Join our day by day and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more
Microsoft has no intention of basing its success in AI on its partnership with OpenAI.
No, quite the opposite. Instead, the company often generally known as Redmond because of its Washington state headquarters got here out swinging today, releasing three recent models in its expanding Phi AI language/multimodal series.
Three recent Phi 3.5 models cover 3.82 billion parameters Phi-3.5-mini-instruct41.9 billion parameters Phi-3.5-MoE-manualand parameter 4.15 billion Phi-3.5-vision-manualeach of them is designed for basic/fast reasoning, more advanced reasoning, and tasks requiring vision (image and video evaluation), respectively.
All three models are available for developers to download, use and fine-tune in Hugging Face Microsoft Branded MIT License which allows for business use and modification without restrictions.
Amazingly, all three models boast near-state-of-the-art performance in quite a few third-party benchmarks, outperforming even solutions from other AI vendors including Google’s Gemini 1.5 Flash, Meta’s Llama 3.1, and in some cases even OpenAI’s GPT-4o.
Such achievements, combined with a permissive, open license, have led people to praise Microsoft on the social networking site X:
Today, let’s take a quick look at each of the recent models, based on their release notes published on Hugging Face
Phi-3.5 Mini Instruct: Optimized for compute-constrained environments
The Phi-3.5 Mini Instruct model is a lightweight AI model with 3.8 billion parameters, designed to follow instructions and supporting a token context length of 128k.
This model is ideal for scenarios requiring strong reasoning abilities in environments with limited memory or computational power, including tasks equivalent to code generation, mathematical problem solving, and logic-based reasoning.
Despite its compact size, the Phi-3.5 Mini Instruct delivers competitive performance for conversational tasks involving multiple languages and phrases, a significant improvement over previous models.
In many benchmarks it offers performance near state-of-the-art solutions, and in the RepoQA test, which measures “understanding of long contextual code”, it outperforms other models of comparable size (Llama-3.1-8B-instruct and Mistral-7B-instruct).
Phi-3.5 MoE: Microsoft’s “Mix of Experts”
The Phi-3.5 MoE (Mixture of Experts) model appears to be the first in this class of models from the company, combining several various kinds of models, each specializing in a different task.
This model uses an architecture with 42 billion energetic parameters and supports a 128k token context length, providing scalable AI performance for demanding applications. However, it only works with 6.6B energetic parameters, based on HuggingFace documentation.
Designed to excel in a number of reasoning tasks, Phi-3.5 MoE offers high performance in coding, math, and multilingual language understanding, often outperforming larger models on specific benchmarks including RepoQA:
In an impressive comparison to the GPT-4o, the five-shot MMLU (Massive Multitask Language Understanding) mini test covers subjects equivalent to STEM, humanities, and social sciences at a number of proficiency levels.
The unique architecture of the MoE model allows it to keep up performance while handling complex AI tasks across multiple languages.
Phi-3.5 Vision Instruct: Advanced Multimodal Reasoning
The trio is accomplished by the Phi-3.5 Vision Instruct, which integrates text and image processing capabilities.
This multimodal model is particularly useful for tasks equivalent to general image understanding, optical character recognition, understanding graphs and tables, and summarizing video material.
Like other models in the Phi-3.5 series, Vision Instruct supports a 128 KB token context length, enabling it to administer complex visual tasks spanning multiple frames.
Microsoft emphasizes that the model was trained using a combination of synthetic and filtered publicly available datasets, with an emphasis on high-quality and high-inference-density data.
Training of the recent Phi trio
The Phi-3.5 Mini Instruct model was trained on 3.4 trillion tokens using 512 H100-80G GPUs in 10 days, while the Vision Instruct model was trained on 500 billion tokens using 256 A100-80G GPUs in 6 days.
The Phi-3.5 MoE model, which features a mixed-expert architecture, was trained on 4.9 trillion tokens and 512 H100-80G GPUs in 23 days.
Open source software under the MIT license
All three Phi-3.5 models are available under the MIT License, reflecting Microsoft’s commitment to supporting the open source community.
This license allows developers to freely use, modify, mix, publish, distribute, sublicense, and sell copies of the software.
The license also includes a disclaimer that the software is provided “as is” without warranty of any kind. Microsoft and other copyright holders are not responsible for any claims, damages or other liabilities which will arise from the use of the software.
The launch of the Phi-3.5 series by Microsoft represents a significant step forward in the development of multilingual and multimodal AI.
By making these models open source, Microsoft enables developers to integrate cutting-edge AI capabilities into their applications, driving innovation in each business and research settings.