MetaGPT Launches RealDevWorld: 92% Accuracy Outperforms Claude, End-to-End Testing Reimagines the Future of AI Development!

AIbase基地

Published inAI News · 6 min read · Sep 3, 2025

Recently, the MetaGPT team has launched a groundbreaking end-to-end automated testing tool called RealDevWorld, sparking discussions in the AI-driven software development field. With its impressive performance and efficient testing capabilities, RealDevWorld achieved a 92% accuracy rate in the RealDevBench benchmark test, and its evaluation consistency even surpassed advanced models like Claude.

RealDevWorld: A Revolutionary Breakthrough in End-to-End Automated Testing

RealDevWorld is a new automated testing tool developed by MetaGPT based on its multi-agent framework, aiming to achieve full-process autonomy from code generation to quality assurance. Through the AppEvalPilot module, it simulates the systematic process of professional testers, performing acceptance testing according to product design and scenario boundaries, and also supports 7x24-hour continuous comprehensive testing.

Compared to traditional testing tools, RealDevWorld uses a dynamic evaluation mechanism, overcoming the limitations of static benchmark testing, and can adapt in real-time to complex development scenarios. Its efficiency is remarkable: it can complete a comprehensive assessment of 15-20 functional components in an application within an average of 8-9 minutes, with each test costing as little as about $0.26, significantly reducing the testing costs for development teams.

92% Accuracy, Exceeding Claude's Evaluation Consistency

In the RealDevBench benchmark test, RealDevWorld demonstrated strong performance, achieving a 92% accuracy rate, and exceeded the Claude model from Anthropic in terms of evaluation consistency. This breakthrough was made possible by the optimization of MetaGPT's multi-agent collaboration framework, combining the power of GPT-4o and Claude3.5-Sonnet.

RealDevWorld can accurately identify potential issues in code through intelligent task decomposition and collaboration mechanisms, and generate high-quality test reports. AIbase analysis suggests that this performance advantage enables it to perform well in handling complex software engineering tasks such as code generation, debugging, and verification, especially suitable for enterprise-level applications requiring high reliability.

Full-Process Autonomy: From Code Generation to Quality Assurance

System: Unified Code Base, Supporting Three Platforms

A major highlight of RealDevWorld is its unified code base, supporting desktop, mobile, and Web platforms. This means developers do not need to write separate test scripts for different platforms, greatly simplifying the cross-platform testing process. Whether it's UI validation for web applications, interaction testing for mobile applications, or functional evaluation for desktop software, RealDevWorld can provide a consistent testing experience.

Through deep integration with MetaGPT's multi-agent architecture, RealDevWorld can automatically generate test cases, execute regression tests, and provide detailed diagnostic feedback. Its dynamic evaluation mechanism can adjust testing strategies in real-time according to application updates, ensuring that test results remain highly aligned with actual needs.

Low Cost, High Efficiency: Redefining Testing Economics

RealDevWorld not only boasts powerful performance but also impresses with its cost-effectiveness. According to official data, the tool can complete the evaluation of 15-20 functional components in 8-9 minutes, with each test costing only $0.26. This high-efficiency and low-cost feature makes it an ideal choice for both small and medium-sized development teams and large enterprises.

AIbase believes that the emergence of RealDevWorld will significantly reduce the testing barriers in AI-driven development, helping developers deliver high-quality software products more quickly.

Future Outlook: A New Industry Benchmark for AI Testing

The release of RealDevWorld marks a major breakthrough for MetaGPT in the field of AI automated testing. Compared to traditional testing frameworks such as Selenium or Cypress, RealDevWorld offers higher flexibility and intelligence through AI-driven dynamic evaluation and multi-agent collaboration. Industry experts predict that this tool may become an industry benchmark in the software testing field in 2025, especially in agile development environments with rapid iterations.

MetaGPT team stated that RealDevWorld will continue to be optimized in the future, supporting more programming languages and more complex testing scenarios.

Project Homepage: https://realdevworld.metadl.com/

Paper: https://arxiv.org/pdf/2508.14104

WeChat Mini Program Education Platform Freely Opens AI Teaching Projects to Teachers and Students, Exceeding 170,000

The WeChat Mini Program Education Platform has been in operation for five years and has covered over 5,000 schools worldwide. Teachers and students have collectively created more than 170,000 mini program projects. The platform provides free AI teaching tools, lowers the entry barrier, and supports the entire learning and application process.

New Era of Intelligent Transportation: How the Large Model Gateway is Changing the Landscape of AI Applications

As an intelligent transportation hub, the large model gateway aims to efficiently manage the surge in AI API traffic and address the challenges of enterprises accessing and managing multiple AI models. It uniformly handles heterogeneous interfaces and data formats from different providers, avoiding redundant construction across departments, thus reducing resource waste and technical fragmentation, and ensuring smooth usage of various AI models by enterprises.

AI Daily: OpenAI Launches macOS Version of Codex App; Zhipu Releases Lightweight GLM-OCR with 0.9B Parameters; Firefox 148 Browser to Be Released Soon

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Each day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Click to learn more about new AI products: https://app.aibase.com/zh1, Adobe Firefly announces unlimited AI video and image generation for subscription users. The upgrade of Adobe Firefly provides creative professionals with more powerful AI tools support.

Refuse to Ruin Classics! State Administration of Radio, Film, and Television Takes Action: Douyin and Xiaohongshu Remove Thousands of AI-Messed-About Videos

Starting January 1, 2026, the State Administration of Radio, Film, and Television will carry out a one-month special governance campaign against "AI-messed-about" videos, focusing on cracking down on the malicious alteration of film and television classics and the images of heroic role models using artificial intelligence technology. The initiative aims to protect social mainstream values and cultural heritage. The scope of the governance includes strictly prohibiting improper adaptations of classic works such as the Four Great Classical Novels.

Former OpenAI Expert Issues Warning: AI Won't Learn from Mistakes, AGI Faces Key Bottleneck

Although AI large models perform well on reasoning tasks, their core deficiency lies in their inability to learn from mistakes. OpenAI researcher Jerry Tworek pointed out that the lack of an effective correction mechanism after model failure has become a key obstacle in achieving artificial general intelligence.

Institution: User enthusiasm for AI applications is high, AIGC app monthly active users increased by more than 200 million

QuestMobile data shows that as of December 2025, the total number of monthly active users in China exceeded 1.276 billion, with an average monthly usage time of 186.2 hours, an 8.4% year-on-year increase. This reflects the continued rising enthusiasm of users for mobile internet services, especially with the promotion of artificial intelligence technology. In December, the AIGC application industry performed outstandingly, with a significant increase in the scale of monthly active users.

No need to download an app! Linq secures $20 million in funding, integrating AI assistant natively through iMessage and SMS

The convenience of AI agent interaction has drawn attention, with the startup Linq securing a $20 million Series A round led by TQ Ventures. The funds will be used for team expansion, market promotion, and technology development. Linq transitioned from digital business cards and now focuses on providing AI interaction solutions for enterprises.

Strong Rise in Mainland China's Large Model Concept Stocks: MINIMAX-WP Shares Reach Record High, Zhipu Follows Closely

On February 3, 2026, the large model concept sector in the Mainland China stock market rose collectively. MINIMAX-WP led the gain, with a rise of over 14% and a new all-time high share price, reflecting the market's optimism about its technology and business prospects. Zhipu also saw a gain of over 11% in the afternoon, with the entire sector showing strong performance.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

MetaGPT Launches RealDevWorld: 92% Accuracy Outperforms Claude, End-to-End Testing Reimagines the Future of AI Development!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

WeChat Mini Program Education Platform Freely Opens AI Teaching Projects to Teachers and Students, Exceeding 170,000

New Era of Intelligent Transportation: How the Large Model Gateway is Changing the Landscape of AI Applications

AI Daily: OpenAI Launches macOS Version of Codex App; Zhipu Releases Lightweight GLM-OCR with 0.9B Parameters; Firefox 148 Browser to Be Released Soon

Refuse to Ruin Classics! State Administration of Radio, Film, and Television Takes Action: Douyin and Xiaohongshu Remove Thousands of AI-Messed-About Videos

Former OpenAI Expert Issues Warning: AI Won't Learn from Mistakes, AGI Faces Key Bottleneck

AI Scholar Suffers a Setback! GPT-4o Expert Exam Scored Only 2.7 Points

Institution: User enthusiasm for AI applications is high, AIGC app monthly active users increased by more than 200 million

Microsoft Invests Billions in AI but Struggles to Monetize? Copilot Paid Conversion Rate is Only 3.3%

No need to download an app! Linq secures $20 million in funding, integrating AI assistant natively through iMessage and SMS

Strong Rise in Mainland China's Large Model Concept Stocks: MINIMAX-WP Shares Reach Record High, Zhipu Follows Closely

AI News Recommendations

WeChat Mini Program Education Platform Freely Opens AI Teaching Projects to Teachers and Students, Exceeding 170,000

New Era of Intelligent Transportation: How the Large Model Gateway is Changing the Landscape of AI Applications

AI Daily: OpenAI Launches macOS Version of Codex App; Zhipu Releases Lightweight GLM-OCR with 0.9B Parameters; Firefox 148 Browser to Be Released Soon

Refuse to Ruin Classics! State Administration of Radio, Film, and Television Takes Action: Douyin and Xiaohongshu Remove Thousands of AI-Messed-About Videos

Former OpenAI Expert Issues Warning: AI Won't Learn from Mistakes, AGI Faces Key Bottleneck

AI Scholar Suffers a Setback! GPT-4o Expert Exam Scored Only 2.7 Points

Institution: User enthusiasm for AI applications is high, AIGC app monthly active users increased by more than 200 million

Microsoft Invests Billions in AI but Struggles to Monetize? Copilot Paid Conversion Rate is Only 3.3%

No need to download an app! Linq secures $20 million in funding, integrating AI assistant natively through iMessage and SMS

Strong Rise in Mainland China's Large Model Concept Stocks: MINIMAX-WP Shares Reach Record High, Zhipu Follows Closely

GEO Services