Recently, SuperCLUE released the 2025 Annual Chinese Large Model Benchmark Report, drawing the attention of many tech enthusiasts. In this evaluation, a total of 23 domestic and international large models participated, covering six core dimensions including mathematical reasoning, scientific reasoning, and code generation. The evaluation results show that overseas closed-source models still hold a leading position, especially Anthropic's Claude-Opus-4.5-Reasoning, which scored 68.25 points and ranked first, becoming the standout performer in this evaluation.
Following closely behind are Google's Gemini-3-Pro-Preview and OpenAI's GPT-5.2 (high), scoring 65.59 and 64.32 points respectively, ranking second and third. The strength of these overseas giants remains impressive. However, it is worth noting that domestic large models also demonstrated remarkable performance in this evaluation, especially the open-source model Kimi-K2.5-Thinking and the closed-source model Qwen3-Max-Thinking, which secured fourth and sixth places with scores of 61.50 and 60.61 points respectively.
In specific areas, domestic models performed particularly outstandingly. Kimi-K2.5-Thinking achieved the top score of 53.33 points in the code generation task, while Qwen3-Max-Thinking tied with Gemini-3-Pro-Preview in the mathematical reasoning task, achieving an impressive score of 80.87 points to claim the top spot. These achievements indicate that domestic models are gradually moving from "following" to "running side by side," showing strong catching-up capabilities.
Overall, overseas closed-source models still lead over domestic models, but domestic open-source models performed well, holding an absolute advantage in the Top 5, demonstrating the strong power and development potential of domestic open-source models. With continuous technological progress and accelerated domestic research, the field of Chinese large models may soon bring more surprises and challenges.





