The three giants of the AI world are experiencing an unprecedented defeat. When models such as GPT-5, Claude Opus4.1, and Gemini2.5—referred to as the jewels on the crown of artificial intelligence—faced Scale AI's newly released SWE-BENCH PRO programming evaluation, they all suffered a complete failure, with none of them breaking through the 25% solution rate threshold.

This news hit the entire AI industry like a heavy blow. GPT-5 only achieved a score of 23.3%, followed closely by Claude Opus4.1 with 22.7%, while Google's Gemini2.5 fell to a dismal performance of 13.5%. These numbers reveal a chilling message: even the most advanced AI models today still struggle when facing truly complex programming challenges.

image.png

However, when we look beyond the surface, the truth turns out to be more complex than expected. Neil Chowdhury, a former OpenAI researcher, provided a deep analysis that revealed another dimension of the story. He found that GPT-5 had an actual accuracy rate of up to 63% on tasks it chose to attempt, far surpassing Claude Opus4.1's 31%. This means that although GPT-5 seems mediocre in overall performance, it still maintains a significant competitive advantage within its areas of expertise.

So, what caused these former AI champions to fall so dramatically in the face of the new test? The answer lies in the unique design philosophy of SWE-BENCH PRO. This test set, meticulously crafted by OpenAI in August 2024, acts like a sharp scalpel, specifically designed to dissect the true capabilities of current AI models.

image.png

Compared to previous tests like SWE-Bench-Verified, which often had correct rates of up to 70%, the difficulty of SWE-BENCH PRO is not just a simple numerical game. The test team deliberately avoided data that might have been used for model training, completely eliminating the long-standing problem of data contamination in AI evaluations. As a result, models can no longer rely on memorized answers to pass; they must demonstrate real reasoning and problem-solving abilities.

SWE-BENCH PRO covers an extensive range of problems, including 1865 real-world issues from commercial applications and developer tools. These questions are carefully divided into public sets, commercial sets, and reserved sets, ensuring that every model faces entirely new challenges during evaluation. More impressively, the research team introduced an artificial enhancement mechanism during the testing process, further increasing the complexity and authenticity of the tasks.

image.png

The test results unflinchingly exposed the weaknesses of current AI models. Their ability to solve real-world commercial problems remains clearly limited. Especially in handling mainstream programming languages like JavaScript and TypeScript, the solution rates of various models showed dramatic fluctuations. Researchers found through in-depth analysis that different models demonstrated significant differences in their ability to understand and handle similar tasks, reflecting fundamental differences in their technical approaches and training strategies.

image.png

More importantly, GPT-5's high unanswered rate of 63.1% serves as a mirror, clearly reflecting the real state of current AI technology development. Even the most advanced models often choose to remain silent rather than risk giving potentially incorrect answers when facing complex challenges. This cautious attitude, although in some ways reflects the model's self-awareness, also sounds a warning bell for the entire industry's technological advancement.

This test is not just a simple technical assessment; it is more like a profound examination of the current state of the AI industry. It reminds us that although artificial intelligence has made remarkable achievements in certain fields, there is still a long way to go in complex real-world application scenarios.