Recently, the China Academy of Information and Communications Technology (CAICT) officially launched the "Fangsheng" benchmarking system version 3.0, marking another major advancement in AI evaluation in China. This new version has been comprehensively upgraded based on previous versions, adding tests for model basic attributes and systematically evaluating key underlying features such as model parameter scale and inference efficiency. In addition, the system has proactively laid out future advanced intelligent testing, focusing on ten advanced capabilities such as full-modal understanding, long-term memory, and self-learning, providing more in-depth scenario-based evaluations for key industries such as industrial manufacturing, fundamental science, and finance.
To better implement "Fangsheng" 3.0, CAICT has strengthened the construction of evaluation infrastructure in multiple aspects. First, they plan to expand high-quality test data resources, adding 3 million new data entries to meet the evaluation needs of multi-language, multi-task, and multi-scenario models. Second, CAICT will conduct systematic research and apply advanced testing methods, focusing on solving key technical challenges in large model evaluation, such as high-quality test data synthesis and quality assessment. In addition, CAICT will build a new generation of intelligent evaluation base, adding simulation testing environments for multi-agent interaction and environmental perception, to meet the evaluation needs of agent collaboration and dynamic environment adaptation in complex scenarios.
Starting from 2024, CAICT will conduct a large model benchmarking test every two months. The latest round of testing evaluated 141 large models and 7 agents, covering basic abilities, reasoning abilities, code application capabilities, and multi-modal understanding abilities. The test results showed that OpenAI's GPT-5 continued to lead in comprehensive abilities, while domestic models such as Alibaba's Qwen3-Max-Preview and Moonshot's Kimi K2 performed well. In the evaluation of multi-modal models, image understanding capabilities have made breakthroughs, but there is still room for improvement in complex logical reasoning tasks.
Additionally, the test results on code application capabilities also showed that although they performed well in simple function-level tasks, they still had shortcomings in real project development. This also means that the technological competition between domestic and international players remains intense, and agents still need to improve in multi-modal understanding and complex information processing.
CAICT will continue to strengthen R&D in large model evaluation technology, enhance the credibility and authority of evaluations, and support the cutting-edge innovation in artificial intelligence and the development of new industrialization.