Recently, the MiMo-VL multi-modal model developed by Xiaomi Company has taken over the baton from MiMo-7B, showcasing powerful capabilities in multiple fields. The model significantly outperforms the same-size benchmark multi-modal model Qwen2.5-VL-7B in tasks such as general question answering and understanding inference for images, videos, and language, and its performance in GUI Grounding tasks is comparable to that of specialized models, preparing for the arrival of the Agent era.