Recently, the MiMo-VL multimodal model developed by Xiaomi Company has taken over the baton from MiMo-7B and demonstrated strong capabilities in multiple fields. The model significantly outperforms its peers in tasks such as general question answering and understanding inference for images, videos, and language. It even rivals specialized models in the GUI Grounding task, preparing for the advent of the Agent era.

WeChat_Screenshot_20250530093852.png

The MiMo-VL-7B model has achieved remarkable results in multimodal reasoning tasks. Despite having only 7 billion parameters, it surpasses Alibaba's Qwen-2.5-VL-72B and QVQ-72B-Preview (which have 10 times more parameters) in the Olympic Bench (OlympiadBench) and several math competitions (MathVision, MathVerse). It also outperforms the closed-source model GPT-4o. In internal large model arena evaluations of real user experience, MiMo-VL-7B surpassed GPT-4o, becoming a standout among open-source models. In practical applications, the model excels in complex image reasoning and question answering and demonstrates great potential in GUI operations spanning over ten steps, even helping users add Xiaomi SU7 to their wishlists.

MiMo-VL-7B’s comprehensive visual perception capabilities are due to high-quality pre-training data and innovative hybrid online reinforcement learning algorithms (MORL). During the multi-stage pre-training process, Xiaomi collected, cleaned, and synthesized high-quality multimodal pre-training data, totaling 2.4 trillion tokens, covering types such as image-text pairs, video-text pairs, and GUI operation sequences. By adjusting the proportions of different data types in stages, the model's long-range multimodal reasoning capabilities were strengthened. Hybrid online reinforcement learning combines feedback signals such as text reasoning, multimodal perception + reasoning, and RLHF, and through online reinforcement learning algorithms, it stabilizes and accelerates training, comprehensively improving the model’s reasoning, perception performance, and user experience.

Related Links: https://huggingface.co/XiaomiMiMo.