Recently, a new evaluation benchmark called RBench-V has been released by research teams from Tsinghua University, Tencent Huaqin, Stanford University, and Carnegie Mellon University. This benchmark is specifically designed to test the visual reasoning capabilities of multimodal large models. The introduction of this benchmark aims to fill the gap in the current assessment system regarding the model's ability to produce visual outputs, enabling a more comprehensive understanding of the performance of existing models.

image.png

The RBench-V benchmark test includes 803 questions covering multiple fields, including geometry and graph theory, mechanics and electromagnetism, multi-object recognition, and path planning. Unlike previous evaluations that only required text responses, this assessment particularly requires models to generate or modify image content to support the reasoning process. This means that models not only need to understand the problem but also need to think like humans, drawing auxiliary lines or observing graphic structures.

The test results show that even the best-performing o3 model achieved an accuracy rate of only 25.8%, far lower than the human expert's 82.3%. Google's Gemini2.5 model followed closely behind, scoring only 20.2%. More worryingly, the accuracy rates of many open-source models are between 8% and 10%, with some models performing close to random guessing.

image.png

The study shows that current multimodal large models often adopt simplified strategies when handling complex geometric problems. Unlike humans who think through intuitive visualization methods, most models tend to abstract graphical problems into algebraic expressions, using textual reasoning instead of real image operations. This phenomenon reflects their shortcomings in deeply understanding image information.

The research team points out that future models need to generate images autonomously during the reasoning process to assist in thinking, in order to truly achieve "human-like intelligence." They mention that new methods such as multimodal reasoning chains and intelligent agent reasoning may be important paths for the development of artificial intelligence.

For more information, please visit the project homepage: [RBench-V Project Homepage] (https://evalmodels.github.io/rbenchv/).

Key Points:  

🔍 Research teams jointly released RBench-V to evaluate the visual reasoning capabilities of multimodal large models.  

📉 The best-performing o3 model scored only 25.8%, far below the human accuracy rate of 82.3%.  

🧩 Current models lack deep understanding when processing visual problems and need to improve their reasoning methods to promote intelligent development.