The competitive landscape of global multimodal large models has been updated again. Recently, the authoritative evaluation platform SuperCLUE-VLM released the comprehensive list of multimodal vision-language models for December 2025. Google's Gemini-3-Pro led with an impressive score of 83.64 points, showcasing its overwhelming advantage in visual understanding and reasoning. ByteDance's Douyin large model secured a strong third place with 73.15 points, while SenseTime's SenseNova V6.5Pro ranked second with 75.35 points. Overall, domestic large models performed outstandingly, demonstrating China's rapid progress in the multimodal field.
Evaluation Dimensions: Three Capabilities Fully Measure a Model's "Vision"
SuperCLUE-VLM evaluates a model's real visual understanding ability from three core dimensions:
- Basic Cognition: Identify basic elements such as objects, text, and scenes in images;
- Visual Reasoning: Understand the logic, causal relationships, and implicit information in images;
- Visual Application: Complete tasks such as image-text generation, cross-modal Q&A, and tool invocation.
Gemini-3-Pro Dominates, Domestic Models Catch Up
Google's Gemini-3-Pro leads in all three indicators:
- Basic Cognition: 89.01 points
- Visual Reasoning: 82.82 points
- Visual Application: 79.09 points
Its overall performance far surpasses other competitors, reinforcing Google's dominant position in the multimodal field.
Domestic models also showed strong performance:
- SenseTime's SenseNova V6.5Pro ranks second with 75.35 points, showing balanced reasoning and application capabilities;
- ByteDance's Douyin large model ranks third with 73.15 points, achieving an impressive 82.70 points in basic cognition, even surpassing some international models, but slightly lacking in visual reasoning;
- Baidu's ERNIE-5.0-Preview and Alibaba's Qwen3-VL follow closely, both entering the top five.
Notably, Qwen3-VL became the first open-source multimodal model on the list to exceed 70 points in total, providing global developers with a high-performance, commercializable open foundation.

International Giants Show Divergence: Claude Performs Steadily, GPT-5.2 Falls Behind Unexpectedly
In the international group, Anthropic's Claude-opus-4-5 scored 71.44 points, ranking in the mid-top, continuing its advantage in language understanding. However, OpenAI's GPT-5.2 (high configuration) only scored 69.16 points, placing it lower on the list, sparking discussions about its optimization direction in multimodal capabilities.
AIbase Observation: The Multimodal Competition Enters a New "Practical" Stage
The SuperCLUE-VLM list is not just a technical ranking but also reflects industry trends:
- Rise of Open Source Models: Qwen3-VL proves that the open source approach can also achieve high performance, promoting the democratization of technology;
- Focus on Scenario Implementation by Domestic Models: Models like Douyin and SenseTime perform well in basic cognition, aligning with frequent needs such as Chinese internet image-text understanding and short video analysis;
- Visual Reasoning Remains a Bottleneck: Most models still have gaps in advanced tasks like complex logic and causal inference, which is a key factor behind Gemini's continued leadership.

