Recently, OpenAI launched a new benchmark test aimed at evaluating the performance differences between its artificial intelligence models and human professionals across various industries. This test, named GDPval, is an important exploration by OpenAI to determine whether its AI systems can surpass human professionals in work related to economic value. According to OpenAI, the GPT-5 model and the Claude Opus4.1 model from Anthropic have achieved work quality that is close to industry experts in certain areas.
However, OpenAI also pointed out that these models will not immediately replace human jobs. Although some executives predict that artificial intelligence will replace human jobs within a few years, OpenAI acknowledges that the current GDPval test covers only a small portion of human work tasks. Therefore, it is just one way to assess the progress of artificial intelligence.
The GDPval test covers nine major industries that contribute most to the U.S. Gross Domestic Product (GDP), including healthcare, finance, manufacturing, and government sectors. The test evaluates the performance of 44 occupations in these industries, ranging from software engineers to nurses and journalists. OpenAI invited professionals in the initial test to compare AI-generated reports with those created by other professionals and select the best ones. For example, investment bankers were asked to create competitor analysis reports on the last-mile delivery industry and compare them with AI-generated reports. OpenAI then counted the ratio of AI models "winning" in the 44 occupations.
It is reported that in the enhanced GPT-5-high version test, the model performed better than or was on par with industry experts in 40.6% of the tasks. Meanwhile, the Claude Opus4.1 model from Anthropic achieved a rate of 49% where it performed better than or was on par with industry experts. OpenAI believes the high score of the Claude model is mainly due to its ability to create visually appealing graphics, rather than just performance.
Notably, the responsibilities of most professional workers go far beyond submitting research reports, so the scope of the GDPval-v0 test is relatively limited. OpenAI stated that it plans to develop a more comprehensive test in the future to cover more industries and interactive workflows. Nevertheless, the company remains optimistic about the progress of GDPval.
Alan Chatterjee, OpenAI's Chief Economist, said in an interview that the GDPval results indicate that people can use AI models to spend their time on more meaningful tasks in these professions. As model capabilities improve, professionals will be able to use these tools to alleviate part of their workload, thus focusing on higher-value work.
Blog: https://openai.com/index/gdpval/
Key Points:
🌟 OpenAI released a new benchmark test called GDPval to evaluate the performance of AI models across multiple industries, with their abilities gradually approaching those of human experts.
🤖 The GPT-5 model performed better than or was on par with industry experts in 40.6% of the tasks among 44 occupations, while the Claude Opus4.1 reached 49%.
📈 OpenAI plans to launch a more comprehensive test in the future to more accurately assess the capabilities and performance of AI in real-world work.