On June 23, the first AI capability assessment report for the college entrance examination (Gaokao) application scenario, "Gaokao AI Assessment Benchmark," was released. The report was independently completed by Yousong Lab, with the Qwen Gaokao application agent as the evaluation subject. The results show that Qwen's performance has reached the level of human application consultants, and it has advantages in stability, accuracy, structured expression, and efficiency.

Yousong Lab is an independent research team focused on artificial intelligence and educational decision-making research. It has long been concerned with large model capability assessment, AI applications in educational scenarios, and issues related to information, cognition, and decision-making in students' academic advancement choices. Its research findings have been adopted by many universities and research institutions. The released assessment benchmark aims to establish a public, reproducible, and scalable evaluation framework for the rapidly emerging Gaokao application AI products, clarifying the task boundaries AI can undertake at this stage.
Considering that the Qwen Gaokao Agent is built on Alibaba's 8 years of Gaokao service data and experience, it has industry representativeness in product form, data accumulation, and user coverage, so the report selected it as the first evaluation subject. The human control group consisted of 53 application consultants, with an average of 4.6 years of working experience.
The assessment covered four stages: basic facts and rules of Gaokao applications, simulated application filling, open consultation, and application recommendation reports, corresponding to the main process for students and parents when filling out applications, from researching information and understanding rules to planning schemes and making decisions.
The results showed that Qwen answered all 44 objective questions correctly, achieving a 100% accuracy rate, while the average correct rate of human consultants was 89.3%. In simulated application filling, Qwen's plan included 6 acceptable applications, without any explicit preference violations, and it hit the optimal result of post-evaluation. On average, human consultants provided 5.3 acceptable applications. In open consultation, in 100 anonymous comparisons, experts preferred the Qwen version 58 times, with a "directly displayable" rate of 56.0%, higher than the 33.0% of human consultants. Experts considered Qwen more stable in professional path breakdown, risk warnings, and clarity of expression.
The report concluded that within the scope of the assessment tasks, Qwen's performance has reached the level of senior human consultants, especially showing advantages in stability, accuracy, structured expression, and response efficiency.
However, the report also pointed out that the value of human consultants remains irreplaceable. Especially in topics such as income expectations and employment judgment, which require careful calibration based on individual circumstances, consultants can provide more practical advice. In scenarios such as parent-child communication and value trade-offs, AI solutions with complete structures cannot replace human communication and judgment.
The report suggests that AI is more suitable for efficiently completing information verification, data organization, and initial screening of plans, while consultants can focus more on family communication, value trade-offs, and personalized judgments. Only through complementarity can the application process improve accuracy and better meet the actual needs of students and their families.