Navigation
Recherche
|
Microsoft’s Phi-4-multimodal AI model handles speech, text, and video
jeudi 27 février 2025, 20:06 , par InfoWorld
Microsoft has introduced a new AI model that, it says, can process speech, vision, and text locally on-device using less compute capacity than previous models.
Innovation in generative artificial intelligence isn’t all about large language models (LLMs) running in big data centers: There’s also a lot of work going on around small language models (SLMs) that can run on more resource-constrained devices such as mobile phones, laptops, and other edge computing devices. Microsoft’s contribution is a suite of small models called Phi, of which it introduced the fourth generation in December. Now it’s adding two new members to the Phi family: Phi-4-multimodal and Phi-4-mini. Like their siblings, they will be available through Azure AI Foundry, Hugging Face, and Nvidia API Catalog under the MIT license. Phi-4-multimodal is a 5.6 billion parameter model that uses the mixture-of-LoRAs technique to process speech, vision, and language simultaneously. LoRAs or Low-Rank Adaptations, is a way of improving the performance of a large language model for specific tasks without fine-tuning it across all its parameters. Instead, using LoRA, model developers inserts a smaller number of new weights into the model and only these are trained, making the process faster and more memory-efficient and resulting in more lightweight models that are easier to store and share. This in turn makes Phi-4-multimodal efficient with the capability of low-latency inference while optimizing for on-device execution and reduced computational overhead. Some use cases include using the model locally on smartphones, in cars, and running lightweight enterprise applications, such as a multilingual financial services app. Analysts said Phi-4-multimodal expands the horizon for developers, especially for those looking to develop AI-based applications for mobile devices or devices that are resource-constrained. “Phit-4-multimodal integrates text, image, and audio processing with strong reasoning capabilities, enhancing AI applications for developers and enterprises with versatile, efficient, and scalable solutions,” said Charlie Dai, vice president and principal analyst at Forrester. Yugal Joshi, partner at Everest Group said that although the model can be deployed across compute constrained-environments, mobile devices are not ideal for implementing most generative AI use cases. But he does see the new SLMs as a sign of Microsoft taking inspiration from DeepSeek, which also reduces the need for large scale compute infrastructure to run its models. On the benchmark front, Phi-4-multimodal has a performance gap when compared to Gemini-2.0-Flash and GPT-4o-realtime-preview, on speech question answering (QA) tasks. Microsoft said the smaller size of the Phi-4 models results in less capacity to retain factual question-answering knowledge, but work is being undertaken to improve this capability in the future iterations. Phi-4-multimodel does, though, outperform popular LLMs including Gemini-2.0-Flash Lite and Claude-3.5-Sonnet in mathematical and science reasoning, as well as optical character recognition (OCR), and visual science reasoning. Phi-4-mini is a 3.8 billion parameter model based on a dense decoder-only transformer that supports sequences up to 128,000 tokens. “Despite its compact size, it continues outperforming larger models in text-based tasks, including reasoning, math, coding, instruction-following, and function-calling,” Weizhu Chen, VP of generative AI at Microsoft, wrote in a blog post describing the two new Phi-4 models. IBM updates Granite model family too Separately, IBM has released an update to its Granite family of foundational models in the form of Granite 3.2 2B and 8B models. Big Blue said the new models come with improved chain of though capabilities for enhanced reasoning that helps the models gain improvements in performance over their predecessors. In addition, IBM also has released a new vision language model (VLM) for document understanding tasks which demonstrates performance that matches or exceeds that of significantly larger models – Llama 3.2 11B and Pixtral 12B – on benchmarks, such as DocVQA, ChartQA, AI2D and OCRBench1.
https://www.infoworld.com/article/3834988/microsofts-phi-4-multimodal-ai-model-handles-speech-text-a...
Voir aussi |
56 sources (32 en français)
Date Actuelle
ven. 28 févr. - 00:18 CET
|