Recently, the ModelScope community of Magenta announced the release of a dynamic benchmark dataset named UGMathBench, designed to comprehensively evaluate the mathematical reasoning capabilities of language models across a wide range of undergraduate mathematics subjects. The advent of this dataset fills the gap in current assessments of language models' reasoning abilities in the field of undergraduate mathematics and provides researchers with a richer and more challenging testing platform.

With the rapid development of artificial intelligence technology, natural language models have demonstrated tremendous potential in multiple fields such as automatic translation, intelligent customer service, healthcare, and finance. However, accurately assessing the performance of these models, especially their reasoning capabilities and problem-solving abilities, has always been a focal point for researchers. In recent years, although several benchmark datasets have been developed to evaluate the mathematical reasoning capabilities of language models, as models have rapidly evolved, these datasets have gradually become easier to solve, reducing their level of challenge.

WeChat_Screenshot_20250510101532.png

In this context, the UGMathBench dataset was created. This dataset carefully collected, extracted, and organized a large number of undergraduate mathematics problems from an online assignment grading system, covering 16 subjects including basic arithmetic, single-variable calculus, multivariable calculus, differential equations, probability, and more, totaling 5,062 questions. Unlike previous datasets, UGMathBench provides three different random versions for each question, dynamically changing the math problems by altering numbers, thereby more realistically evaluating the reasoning capabilities of language models.

To ensure the accuracy and fairness of the assessment, the research team also proposed three key metrics: Effective Accuracy (EAcc), Reasoning Gap (Δ), and Robustness Efficiency (RE). Effective Accuracy measures the proportion of questions that language models can correctly answer across all random versions; Reasoning Gap reflects the consistency of language models when answering questions with different random versions; Robustness Efficiency further captures the model's ability to adapt to the same problem with different random versions.

Based on the UGMathBench dataset, the research team conducted a comprehensive evaluation of 23 advanced language models, including commercial closed-source models and open-source models. The evaluation results showed that even language models with advanced reasoning capabilities face significant challenges on the UGMathBench dataset. This result not only reveals the current limitations of language models but also provides important references for developing language models with higher reasoning capabilities in the future.

The release of the UGMathBench dataset not only provides new tools and methods for evaluating the mathematical reasoning capabilities of language models but also strongly supports researchers in gaining a deeper understanding of the internal reasoning logic of language models. Currently, the dataset is available for public download. Researchers and developers can access the dataset and related technical reports through specified links to further explore the potential of language models in the field of mathematical reasoning.

Data set download address:

https://www.modelscope.cn/datasets/xinxu02/UGMathBench

https://huggingface.co/datasets/UGMathBench/ugmathbench

Technical report address:

https://arxiv.org/abs/2501.13766