Recently, Alibaba International officially released its next-generation multimodal large model Ovis2.5 and open-sourced it to the public. This model focuses on native resolution visual perception, deep reasoning, and cost-effective scenario design, aiming to further enhance the application capabilities of artificial intelligence. The comprehensive score of Ovis2.5 on the mainstream multimodal evaluation suite OpenCompass has significantly improved compared to the previous version Ovis2, continuing to maintain SOTA (State-of-the-Art) level among similar open-source models.
The released Ovis2.5 includes two versions with different parameter scales. First is Ovis2.5-9B, which achieved a high score of 78.3 in the OpenCompass evaluation, surpassing many models with larger parameter counts and ranking first among open-source models with less than 40B parameters. Second, Ovis2.5-2B has a comprehensive score of 73.9, continuing the Ovis series' concept of "small size, big power," making it especially suitable for edge-side and resource-constrained application scenarios.
In terms of the overall architecture of Ovis2.5, the official stated that systematic innovations were implemented, mainly reflected in three aspects: model architecture, training strategy, and data engineering. In terms of model architecture, Ovis2.5 continues the series' innovative structured embedding alignment design, consisting of three core components: dynamic resolution visual feature extraction, visual vocabulary modules to achieve structural alignment between vision and text, and powerful language processing capabilities based on Qwen3.
In terms of training strategy, Ovis2.5 adopts a more refined five-stage training plan, including basic visual pre-training, multimodal pre-training, and large-scale instruction fine-tuning, among other steps. At the same time, algorithms such as DPO and GRPO are used to strengthen preference alignment and reasoning capabilities, effectively improving the model's performance. Additionally, the model's training speed has achieved a 3 to 4 times end-to-end acceleration.
In terms of data engineering, the amount of data in Ovis2.5 has increased by 50% compared to Ovis, focusing on key areas such as visual reasoning, charts, OCR (Optical Character Recognition), and Grounding. In particular, a large amount of "thinking" data deeply adapted to Qwen3 was synthesized, greatly stimulating the model's reflective and reasoning potential.
The code and models of Ovis2.5 are now available on platforms such as GitHub and Hugging Face, and users can access relevant resources through these platforms to further explore their application potential.
Code: https://github.com/AIDC-AI/Ovis
Model: https://huggingface.co/AIDC-AI/
Key Points:
🌟 Ovis2.5 achieved a comprehensive score of 78.3 in the OpenCompass evaluation, maintaining the SOTA level.
🔧 It includes two versions: Ovis2.5-9B is suitable for large-scale applications, while Ovis2.5-2B focuses on resource-constrained scenarios.
📊 It adopts an innovative architecture and training strategy, with a 50% increase in data volume, focusing on key areas such as visual reasoning.