Information

Latest AI News

Explore AI Frontiers, Master Industry Trends

AI Daily Brief

Your Daily AI Brief - Never Miss What's Next

Information

AI Product Finder

Smart Product Discovery - Comprehensive Market Intelligence

AI Product Rankings

AI Product Power Rankings - Performance, Buzz & Trends

AI Product Submit

Submit Your AI Product - Amplify Reach & Drive Growth

Tools

AI Tools Directory

Discover The Best AI Websites & Tools

Information

AI Models Finder

Comprehensive AI Models Collection for All Your Development & Research Needs

LLM Leaderboard

AI LLM Power Rankings - Performance, Buzz & Trends

Model Providers

Discover Trusted AI Model Partners - Guaranteed Reliable Support

Tools

Compare LLMs

Multi-Dimensional Large Model Comparison - Find Your Perfect Match

LLM Cost Calculator

Calculate AI Model Costs Accurately - Optimize Your Budget

LLM Arena

Multi-Model Real-Time Evaluation & Quick Output Comparison

Information

MCP Servers

Discover Popular AI-MCP Services - Find Your Perfect Match Instantly

MCP Client

Easy MCP Client Integration - Access Powerful AI Capabilities

MCP Case Tutorials

Master MCP Usage - From Beginner to Expert

MCP Ranking

Top MCP Service Performance Rankings - Find Your Best Choice

MCP Service Submission

Publish & Promote Your MCP Services

Tools

MCP Playground

Test MCP Services Freely - Quick Online Experience

MCP Inspector

Quick MCP Service Testing - Fast Deployment

Tools

GEO Brand Visibility

All-in-One GEO Brand Insights Platform

AI Brand Monitoring Tool

Analyze & Track How AI Models Cite Your Brand

AI Search Visibility Checker

Detect brand's visibility on AI platforms

GEO Promotion Link Detection

Quickly evaluate the citation of promotion articles on AI platforms

Service

GEO Ranking Optimization System

Own your own GEO system and become a professional GEO optimization service provider.

GEO Services

Achieve Dominant Visibility in AI Search for Your Business or Brand with GEO Services

Tools

AI Model Compatibility Checker

Free PC Hardware Test for DeepSeek & Llama

AI Deployment Calculator

Enter Your Large Model Computing Requirements for Instant GPU, Memory & Server Configuration Recommendations

AI Tutorial

AI Three Titans Suffer a Setback: Latest Programming Test Accuracy Falls Below 25% Across the Board, GPT-5 Also Cannot Escape Misfortune

AIbase基地

Published inAI News · 6 min read · Sep 23, 2025

The three giants of the AI world are experiencing an unprecedented defeat. When models such as GPT-5, Claude Opus4.1, and Gemini2.5—referred to as the jewels on the crown of artificial intelligence—faced Scale AI's newly released SWE-BENCH PRO programming evaluation, they all suffered a complete failure, with none of them breaking through the 25% solution rate threshold.

This news hit the entire AI industry like a heavy blow. GPT-5 only achieved a score of 23.3%, followed closely by Claude Opus4.1 with 22.7%, while Google's Gemini2.5 fell to a dismal performance of 13.5%. These numbers reveal a chilling message: even the most advanced AI models today still struggle when facing truly complex programming challenges.

However, when we look beyond the surface, the truth turns out to be more complex than expected. Neil Chowdhury, a former OpenAI researcher, provided a deep analysis that revealed another dimension of the story. He found that GPT-5 had an actual accuracy rate of up to 63% on tasks it chose to attempt, far surpassing Claude Opus4.1's 31%. This means that although GPT-5 seems mediocre in overall performance, it still maintains a significant competitive advantage within its areas of expertise.

So, what caused these former AI champions to fall so dramatically in the face of the new test? The answer lies in the unique design philosophy of SWE-BENCH PRO. This test set, meticulously crafted by OpenAI in August 2024, acts like a sharp scalpel, specifically designed to dissect the true capabilities of current AI models.

Compared to previous tests like SWE-Bench-Verified, which often had correct rates of up to 70%, the difficulty of SWE-BENCH PRO is not just a simple numerical game. The test team deliberately avoided data that might have been used for model training, completely eliminating the long-standing problem of data contamination in AI evaluations. As a result, models can no longer rely on memorized answers to pass; they must demonstrate real reasoning and problem-solving abilities.

SWE-BENCH PRO covers an extensive range of problems, including 1865 real-world issues from commercial applications and developer tools. These questions are carefully divided into public sets, commercial sets, and reserved sets, ensuring that every model faces entirely new challenges during evaluation. More impressively, the research team introduced an artificial enhancement mechanism during the testing process, further increasing the complexity and authenticity of the tasks.

The test results unflinchingly exposed the weaknesses of current AI models. Their ability to solve real-world commercial problems remains clearly limited. Especially in handling mainstream programming languages like JavaScript and TypeScript, the solution rates of various models showed dramatic fluctuations. Researchers found through in-depth analysis that different models demonstrated significant differences in their ability to understand and handle similar tasks, reflecting fundamental differences in their technical approaches and training strategies.

More importantly, GPT-5's high unanswered rate of 63.1% serves as a mirror, clearly reflecting the real state of current AI technology development. Even the most advanced models often choose to remain silent rather than risk giving potentially incorrect answers when facing complex challenges. This cautious attitude, although in some ways reflects the model's self-awareness, also sounds a warning bell for the entire industry's technological advancement.

This test is not just a simple technical assessment; it is more like a profound examination of the current state of the AI industry. It reminds us that although artificial intelligence has made remarkable achievements in certain fields, there is still a long way to go in complex real-world application scenarios.

AIbuzzwords GPT-5 ClaudeOpus4.1 Gemini2.5

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Google's New Feature Revealed! Let AI Gemini Control Your Android Phone to Complete Daily Tasks

Google tests 'screen automation' in Android, enabling AI assistant Gemini to perform tasks like ordering and booking via voice commands, featured in the app's 17.4 beta version.....

Feb 4, 2026

100

2025 Global Chinese Large Model Ranking Released: Overseas Giants Take Top Three, Domestic Large Models Surpass in Niche Areas

SuperCLUE's 2025 Chinese LLM evaluation report, covering six dimensions including math reasoning and code generation, reveals that overseas closed-source models lead, with Anthropic's Claude-Opus-4.5-Reasoning topping the list at 68.25 points.....

Feb 4, 2026

Global Chinese Language Model Competition! Overseas Strong Contenders Take the Top Three, Domestic Ones Show Promise!

SuperCLUE 2025 Chinese LLM benchmark report released, evaluating 23 models across six dimensions including math and code generation. Overseas closed-source models lead, with Anthropic's Claude-Opus-4.5-Reasoning topping the list at 68.25 points.....

Feb 4, 2026

Apple Official Support App Major Version Update: AI Customer Service Goes Permanent, Diagnostic Features Upgraded

Apple updated its 'Apple Support' app to version 5.12, featuring a refined interface and enhanced AI capabilities. The 'Chat' tab was renamed to 'Ask,' icons were updated, and the 'Early Preview' label was removed, indicating the AI assistant service is now mature and officially integrated into the support system.....

Feb 4, 2026

Major Update to OpenAI's Flagship Model: GPT-5.2 Series Doubles Inference Speed, Price Remains Unchanged

OpenAI's GPT-5.2 series achieves a 40% boost in inference speed through optimized reasoning stacks, maintaining model architecture and weights to reduce latency and enhance response times.....

Feb 4, 2026

110

Kunlun Tech Launches TianGong Skywork Desktop Version: Creating the Strongest AI Brain for Personal Computers

Kunlun Wanwei launches 'Skywork Desktop' AI app, emphasizing local operation for enhanced data security, redefining intelligent desktop office solutions.....

Feb 4, 2026

170

DeepMind Hosts an AI Offline Board Game Session: Gemini 3 Family Dominates Poker and Werewolf Rankings

Google DeepMind and Kaggle enhance Game Arena with 'Werewolf' and 'Poker', shifting AI testing from logic to social reasoning and decision-making under uncertainty for comprehensive model evaluation.....

Feb 4, 2026

130

The Only Giant AI Startup Representative! Yang Zhilin from Moonshot AI Invited to NVIDIA GTC 2026 Conference

NVIDIA's 2026 GTC features Yang Zhilin, founder of Chinese AI firm Moonshot AI, as the sole independent large model startup representative, highlighting global recognition of China's AI advancements.....

Feb 4, 2026

120

Apple Collaborates with OpenAI to Launch Xcode Intelligent Agent Supporting GPT-5 Level Models

Apple launches Xcode 26.3, integrating Claude Agent and Codex for AI-driven automation, advancing from code assistance to autonomous task execution, a major leap in development tools.....

Feb 4, 2026

100

Spring Festival Update Announcement for Large Models: Zhipu GLM-5 and MiniMax M2.2 to be Released

Zhipu AI and MiniMax plan to launch new AI models around the Spring Festival. GLM-5, expected by February 15, may advance creative writing, programming, reasoning, and agent capabilities, potentially redefining AI applications and sparking industry anticipation.....

Feb 3, 2026

310

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

AI Three Titans Suffer a Setback: Latest Programming Test Accuracy Falls Below 25% Across the Board, GPT-5 Also Cannot Escape Misfortune

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Google's New Feature Revealed! Let AI Gemini Control Your Android Phone to Complete Daily Tasks

2025 Global Chinese Large Model Ranking Released: Overseas Giants Take Top Three, Domestic Large Models Surpass in Niche Areas

Global Chinese Language Model Competition! Overseas Strong Contenders Take the Top Three, Domestic Ones Show Promise!

Apple Official Support App Major Version Update: AI Customer Service Goes Permanent, Diagnostic Features Upgraded

Major Update to OpenAI's Flagship Model: GPT-5.2 Series Doubles Inference Speed, Price Remains Unchanged

Kunlun Tech Launches TianGong Skywork Desktop Version: Creating the Strongest AI Brain for Personal Computers

DeepMind Hosts an AI Offline Board Game Session: Gemini 3 Family Dominates Poker and Werewolf Rankings

The Only Giant AI Startup Representative! Yang Zhilin from Moonshot AI Invited to NVIDIA GTC 2026 Conference

Apple Collaborates with OpenAI to Launch Xcode Intelligent Agent Supporting GPT-5 Level Models

Spring Festival Update Announcement for Large Models: Zhipu GLM-5 and MiniMax M2.2 to be Released

AI News Recommendations

Google's New Feature Revealed! Let AI Gemini Control Your Android Phone to Complete Daily Tasks

2025 Global Chinese Large Model Ranking Released: Overseas Giants Take Top Three, Domestic Large Models Surpass in Niche Areas

Global Chinese Language Model Competition! Overseas Strong Contenders Take the Top Three, Domestic Ones Show Promise!

Apple Official Support App Major Version Update: AI Customer Service Goes Permanent, Diagnostic Features Upgraded

Major Update to OpenAI's Flagship Model: GPT-5.2 Series Doubles Inference Speed, Price Remains Unchanged

Kunlun Tech Launches TianGong Skywork Desktop Version: Creating the Strongest AI Brain for Personal Computers

DeepMind Hosts an AI Offline Board Game Session: Gemini 3 Family Dominates Poker and Werewolf Rankings

The Only Giant AI Startup Representative! Yang Zhilin from Moonshot AI Invited to NVIDIA GTC 2026 Conference

Apple Collaborates with OpenAI to Launch Xcode Intelligent Agent Supporting GPT-5 Level Models

Spring Festival Update Announcement for Large Models: Zhipu GLM-5 and MiniMax M2.2 to be Released

GEO Services