Xiaomi has announced the open source of a new version of its multimodal large model - Xiaomi MiMo-VL-7B-2508, and simultaneously released two model versions: SFT and RL. This upgrade not only optimizes the output mode but also improves the stability of RL training, achieving significant progress in various capability evaluations. At the same time, users can flexibly switch between "thinking mode" and "non-thinking mode" to adapt to different scenario requirements.

Compared to the MiMo-VL-7B-RL released in May this year, the new version has achieved breakthroughs on multiple authoritative benchmarks:

Subject Reasoning Test MMMU: Increased from 66.7 to 70.6, the first time breaking 70 points

Document Understanding Test ChartQA: Increased from 91.7 to 94.4

GUI Localization Test ScreenSpot-v2: Increased from 90.5 to 92.5

Video Understanding Test VideoMME: Increased from 67.4 to 70.8

In terms of interaction experience, the new version introduces an autonomous control thinking mode switching function. The default "thinking mode" displays the complete reasoning process, with more comprehensive performance and a 100% control success rate; while the "non-thinking mode" skips the reasoning process, offering faster response speed and a 99.84% control success rate, suitable for tasks requiring high real-time performance.

According to Xiaomi's internal VLM Arena score, the new version of MiMo-VL-7B-RL-2508 received 1131.2 points, significantly higher than the previous generation's 1093.9 points. The evaluation results show that the model comprehensively surpasses the previous generation in most benchmark tests. Even in non-thinking mode, it can maintain excellent performance in perceptual tasks. Compared to other similar multimodal open-source models that support thinking functions, MiMo-VL-7B-RL-2508 still leads the industry.