Alibaba's open-source project Mobile Neural Network (MNN) has released the latest version of its mobile multimodal large model application, MnnLlmApp, adding support for Qwen-2.5-Omni-3B and 7B models. This fully open-source, locally running large model application on mobile devices supports various modal tasks such as text-to-text, image-to-text, audio-to-text, and text-to-image generation, attracting significant attention from developers due to its efficient performance and low resource consumption. AIbase observes that this update by MNN further promotes the popularization of multimodal AI on mobile devices.
Project Address:
https://github.com/alibaba/MNN/blob/master/apps/Android/MnnLlmChat/README.md
Core Highlights: Comprehensive Enhancement of Multimodal Capabilities
The new version of MnnLlmApp integrates Qwen-2.5-Omni-3B and 7B models, leveraging Alibaba Cloud Qwen team's Thinker-Talker architecture to achieve comprehensive processing capabilities for text, images, audio, and video. AIbase learns that the application supports the following functions:
Text-to-text: Generates high-quality dialogues, reports, or code, comparable to cloud models.
Image-to-text: Identifies text in images or describes scene content, suitable for document scanning and visual question answering.
Audio-to-text: Efficiently transcribes speech, supporting multi-language voice recognition.
Text-to-image: Generates high-quality images through diffusion models, meeting creative design needs.
Social media feedback shows that developers are particularly satisfied with the performance of Qwen-2.5-Omni-3B running on a 24GB GPU. It retains over 90% of the multimodal performance of the 7B model in the OmniBench benchmark test while reducing memory usage by more than 50% (from 60.2GB to 28.2GB).
Technical Advantages: Local Inference and Extreme Optimization
MNN framework is renowned for its lightweight and high performance, optimized specifically for mobile and edge devices. AIbase editorial team notices that the new MnnLlmApp performs excellently in CPU inference, prefilling speed is 8.6 times faster than llama.cpp, and decoding speed is 2.3 times faster. The application runs completely locally without requiring an internet connection to handle multimodal tasks, ensuring data privacy is not uploaded to external servers. The supported model range is extensive, covering mainstream open-source models such as Qwen, Gemma, Llama, and Baichuan. Developers can directly download and build applications via GitHub. Additionally, MNN provides FlashAttention-2 support, further enhancing the efficiency of long context processing.
Application Scenarios: From Development to Production
MnnLlmApp’s multimodal capabilities showcase its potential in various scenarios:
Education and Office: Scan documents using the image-to-text function or transcribe meeting records with audio-to-text.
Creative Design: Generate promotional materials or art pieces using text-to-image.
Intelligent Assistants: Build localized voice interaction applications, such as offline navigation or customer service assistants.
Developer Learning: Open-source code and detailed documentation provide reference examples for developing mobile large models.
AIbase analysis suggests that MNN’s open-source nature and support for Qwen-2.5-Omni make it an ideal platform for developers to explore mobile multimodal AI. Social media users report that although MnnLlmApp’s inference speed (Llama3.18B prefilling at 28 tokens/s) does not reach the top level, its multimodal integration and usability are sufficient to meet prototype development needs.
Industry Background: Open Source Craze in Mobile AI
MNN’s update coincides with the rising competition in mobile AI. DeepSeek’s R1 model and Baichuan-Omni have recently also launched open-source multimodal solutions, emphasizing local deployment and cost-effectiveness. However, MNN holds advantages in performance and compatibility thanks to Alibaba’s ecosystem support and hardware optimization (such as deep adaptation to Android devices). AIbase notes that Alibaba Cloud has already open-sourced over 200 generative AI models, with Qwen series downloads exceeding 80 million on Hugging Face, demonstrating its global influence. The iOS version of MnnLlmApp has also been released, further expanding its cross-platform coverage.
Future of Mobile Multimodal AI
This update by MnnLlmApp marks the accelerated migration of multimodal AI from the cloud to edge devices. AIbase editorial team expects that as Qwen-2.5-Omni models continue to be optimized (e.g., supporting longer videos or lower latency voice generation), MNN will play a larger role in smart homes, vehicle systems, and offline assistant fields. However, social media also points out that the application’s model loading process (which requires building external models from source code) still needs simplification to enhance user-friendliness.