With the rapid development of artificial intelligence technology, especially the continuous progress of large models, benchmark testing is facing unprecedented challenges when assessing AI capabilities. To address this situation, Sequoia China announced on May 26th the launch of a new AI benchmarking tool called xbench. This tool not only evaluates the capabilities of AI models but also introduces a dynamic update mechanism to ensure the effectiveness and fairness of the tests.
The launch of xbench stems from Sequoia China's attention to the AGI (Artificial General Intelligence) process after the release of ChatGPT in 2022. With the widespread application of agents in various fields, traditional static benchmark testing methods are becoming inadequate, failing to accurately reflect the actual capabilities of models. Therefore, xbench adopts a dual-track evaluation system: on one hand, by constructing multi-dimensional evaluation datasets to track the theoretical upper limit of model capabilities; on the other hand, focusing on the practical落地value of agents, thus achieving a comprehensive assessment of AI technologies.
In terms of specific evaluation methods, xbench adopts the Evergreen Assessment Mechanism, where the assessment tools will dynamically update to adapt to the rapid iteration of technology. This method not only improves the reliability of the test but also avoids problems such as question leakage, ensuring the fairness of the assessment. In the past, many industry models have often been questioned for "gaming the rankings" due to leaked question banks, and the original intention of xbench's design is to eliminate such hidden dangers.
Besides the basic evaluation system, Sequoia China has also incorporated an evaluation methodology for vertical domain agents in xbench, particularly in applications related to recruitment and marketing. As AI agents continue to develop, deep search, information collection, and reasoning analysis capabilities have become key to achieving AGI. To effectively evaluate these capabilities, xbench will pay special attention to the performance of multimodal models with chains of thought in generating commercial videos, as well as issues related to the credibility of GUI agents in dynamically updated applications.
The launch of xbench not only establishes new standards for evaluating AI agents but also provides the industry with a sustainable assessment tool to cope with the continuous evolution of future AI technologies.