The OpenBMB team recently announced the official open-source release of the new multimodal large model MiniCPM-V4.0. With its lightweight architecture and excellent performance, it is called "GPT-4V on a phone" and is expected to bring revolutionary breakthroughs in AI applications on mobile devices.
The core of MiniCPM-V4.0 lies in its ingenious design. It is built upon SigLIP2-400M and MiniCPM4-3B, with only 4.1B parameters, yet it demonstrates powerful capabilities in image, multi-image, and video understanding. This enables it not only to easily process single images but also to understand complex multi-image relationships and video clips, providing users with a smarter interaction experience.
Despite its small parameter count, MiniCPM-V4.0's performance is astonishing. On eight mainstream evaluation benchmarks of OpenCompass, the model achieved an average score of 69.0, surpassing many competitors such as GPT-4.1-mini and Qwen2.5-VL-3B. This achievement proves its strong strength in visual understanding, especially in handling complex scenarios, where its accuracy and depth of analysis are impressive.
Another major highlight of MiniCPM-V4.0 is its high-level optimization for mobile devices. On the latest iPhone16Pro Max, real-world testing showed that the first response delay was less than 2 seconds, decoding speed exceeded 17token/second, and it could effectively control device heating during operation, ensuring a smooth and stable user experience. In addition, it can handle high-concurrency requests, making it suitable for practical applications on mobile phones, tablets, and other edge devices.
To lower the usage barrier for developers, the OpenBMB team provided rich ecosystem support. MiniCPM-V4.0 is compatible with mainstream frameworks such as llama.cpp, Ollama, and vllm_project, offering developers flexible deployment options. The team also developed a dedicated iOS app, supporting direct operation on iPhone and iPad, and released a detailed Cookbook, providing complete tutorials and code examples.
The release of MiniCPM-V4.0 has opened up new possibilities for the application of multimodal technology. Its main application scenarios are extensive, including:
Image Analysis and Multi-turn Dialogue: Users can upload images, allowing the model to analyze their content and continue the conversation based on that.
Video Understanding: It can analyze video content and provide solutions for scenarios requiring processing of video information.
OCR and Mathematical Reasoning: The model has the ability to recognize text in images and solve mathematical problems, greatly enhancing its practicality in real work and study.
The open-source release of MiniCPM-V4.0 not only demonstrates the outstanding capabilities of domestic AI teams in lightweight model development but also provides global developers with a powerful tool to explore mobile multimodal technology, taking a solid step toward the popularization of AI.