Microsoft's 14B Parameter Model Challenges a 671B Giant AI Agent: Reinforcement Learning Redefines Mathematical Reasoning

AIbase基地

Published inAI News · 4 min read · Sep 8, 2025

The rStar2-Agent model, open-sourced by Microsoft Research, has attracted attention in the field of AI mathematical reasoning. This 14-billion-parameter model surpasses the DeepSeek-R1 model, which has 671 billion parameters, in multiple mathematical benchmark tests through innovative agent reinforcement learning technology.

The core innovation of rStar2-Agent lies in abandoning the traditional chain-of-thought method and adopting an agent interaction mechanism. The model can autonomously plan the reasoning process, use Python code execution tools for verification, and adjust reasoning steps based on feedback, avoiding the common problem of error accumulation in traditional CoT methods.

In authoritative benchmark tests such as the American Invitational Mathematics Examination, rStar2-Agent performed outstandingly. On the AIME24 dataset, its pass@1 accuracy rate reached 80.6%, surpassing DeepSeek-R1's 79.8%, o3-mini's 79.6%, and Claude Opus4.0's 77.0%. It achieved an accuracy rate of 69.8% on AIME25 and 52.7% on HMMT25.

Notably, the response length of rStar2-Agent is significantly shorter. On the AIME24 test, it averages about 9,340 tokens, and about 10,943 tokens on AIME25, roughly half that of DeepSeek-R1, demonstrating higher reasoning efficiency.

In terms of training efficiency, the model completes 510 reinforcement learning steps in just one week, and can be trained with 64 MI300X GPUs. Its reinforcement learning infrastructure supports up to 45,000 concurrent tool calls per step, with an average latency of only 0.3 seconds.

The model introduces the GRPO-RoC algorithm to handle environmental noise during code execution. Through a "resampling when correct" strategy, it retains high-quality reasoning trajectories, improving training effectiveness.

In terms of generalization ability, rStar2-Agent outperforms DeepSeek-V3 on the GPQA-Diamond scientific reasoning benchmark. It also performs well in tasks involving BFCL v3 tools and general tests such as IFEval and Arena-Hard, showing the positive impact of agent reinforcement learning on general capabilities.

Microsoft has open-sourced the code and training methods of rStar2-Agent, implementing multi-stage reinforcement learning training based on the VERL framework. This breakthrough indicates that, through intelligent training strategies, small models can match the performance of large models on specific tasks, providing new possibilities for researchers and developers with limited resources.

rStar2-Agent AI Mathematical Reasoning Agent Interaction Mechanism Microsoft Research

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Meta AI partners with 8 global media outlets: Real-time news response + external link traffic

Meta partners with 8 leading media outlets to integrate real-time news Q&A and article links into its AI chatbot, initially covering global breaking, entertainment, and local news. Users can access clickable sources, driving new traffic to partners like CNN and Fox News, with plans to expand collaborations.....

Dec 8, 2025

100

Aaru Series A Behind the Scenes: Redpoint Leads, Multi-Layer Valuation Below 1 Billion, AI Simulation Population Market Attracts More Funding

AI consumer research platform Aaru raised over $50M in Series A funding led by Redpoint Ventures, using a multi-tier valuation structure with partial shares priced at $1B and lower valuations for specific investors, resulting in an overall valuation below $1B.....

Dec 8, 2025

130

Perplexity Launches BrowseSafe: Protecting AI Browser Proxies

Perplexity has launched the BrowseSafe system, designed to protect AI browser proxies from being manipulated by online content in real time. The system claims a 91% success rate in detecting prompt injection attacks, which is higher than GPT-5's 85% and PromptGuard-2's 35%. Additionally, it runs quickly and can monitor in real time. As AI browser proxies become more widespread, such security solutions are becoming increasingly important.

Dec 8, 2025

150

A Milestone in Cantonese Digitalization! Guangzhou University Launches the AI-DimSum Multimodal Corpus Platform

December 6th to 7th, the 10th Advanced Forum on Language Services was held at Guangzhou University. During the event, the Cantonese Corpus Construction and Large Model Evaluation Lab launched the AI-DimSum Multimodal Cantonese Corpus Platform, aiming to break through the digital challenges of Cantonese as a low-resource language. The platform is centered around the needs of digital Chinese construction and the digitalization of the Greater Bay Area culture, building a multimodal corpus to promote the protection and development of Cantonese in the era of artificial intelligence.

Dec 8, 2025

100

DeepMind CEO Predicts Three AI Development Trends in 2026

DeepMind CEO predicts 2026 as a pivotal year for multimodal AI, interactive video worlds, and reliable AI agents, highlighting Gemini's advanced multimodal capabilities.....

Dec 8, 2025

406B Parameters Drop! Tencent Hunyuan 2.0 Opens Internal Testing, Claims Top Tier Inference Performance

Tencent releases the Hunyuan 2.0 large model, including an enhanced inference version and an instruction-following version, with a total of 406B parameters, supporting a 256K context window, and showing outstanding performance in complex reasoning tasks such as mathematics and code. The model uses a MoE architecture and has been launched on the Tencent Cloud API and is currently undergoing gray-scale testing in some applications.

Dec 8, 2025

Microsoft Launches VibeVoice-Realtime: A New Real-Time Text-to-Speech Model for Interactive Applications

Microsoft launches VibeVoice-Realtime-0.5B, a lightweight real-time text-to-speech model supporting streaming input and long-form output for agent applications and live data narration. It starts speech output in about 300ms, works with language models for responses, and uses a framework with continuous speech tokens for next-token diffusion.....

Dec 8, 2025

70% of professionals in the creative industry feel social pressure due to using AI, worrying about unemployment

AI tools boost efficiency and quality for creative professionals but also cause social pressure and job anxiety, with 97% reporting time savings and 68% noting improved work quality in a survey of 1,250 individuals.....

Dec 8, 2025

130

GPT-5.2 Released Early! OpenAI Sounds Red Alert to Counter Gemini 3, Claims 18% Improvement in Reasoning Speed

OpenAI CEO Sam Altman advanced GPT-5.2's release to Dec 9 to counter Google Gemini 3, boasting 18% faster reasoning, 23% better multimodal efficiency, and 32,768-token context length, all surpassing Gemini 3's current specs.....

Dec 8, 2025

340

Grok 4.20 Stocks: Becoming a God in Stock Trading - 10,000 Dollars Turned into 12,000 in 2 Weeks with a 12% Return Rate, Outperforming GPT-5.1 and Gemini 3.0

In the Alpha Arena 1.5 season, xAI's Grok4.20 model won with a 12.11% return rate, increasing $10,000 to $12,193 in 14 days, becoming the only large language model that generated profit. At the same time, GPT-5.1 and Gemini 3.0 suffered losses of 3.4% and 5.7%, respectively. The competition used a rule without human intervention, requiring the model to trade automatically under the 'ascetic mode' (high leverage restrictions) and the 'situational awareness mode' (can view opponents' holdings).

Dec 8, 2025

110

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services

AI Model Compatibility Checker

AI Deployment Calculator

Microsoft's 14B Parameter Model Challenges a 671B Giant AI Agent: Reinforcement Learning Redefines Mathematical Reasoning

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Meta AI partners with 8 global media outlets: Real-time news response + external link traffic

Aaru Series A Behind the Scenes: Redpoint Leads, Multi-Layer Valuation Below 1 Billion, AI Simulation Population Market Attracts More Funding

Perplexity Launches BrowseSafe: Protecting AI Browser Proxies

A Milestone in Cantonese Digitalization! Guangzhou University Launches the AI-DimSum Multimodal Corpus Platform

DeepMind CEO Predicts Three AI Development Trends in 2026

406B Parameters Drop! Tencent Hunyuan 2.0 Opens Internal Testing, Claims Top Tier Inference Performance

Microsoft Launches VibeVoice-Realtime: A New Real-Time Text-to-Speech Model for Interactive Applications

70% of professionals in the creative industry feel social pressure due to using AI, worrying about unemployment

GPT-5.2 Released Early! OpenAI Sounds Red Alert to Counter Gemini 3, Claims 18% Improvement in Reasoning Speed

Grok 4.20 Stocks: Becoming a God in Stock Trading - 10,000 Dollars Turned into 12,000 in 2 Weeks with a 12% Return Rate, Outperforming GPT-5.1 and Gemini 3.0

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Microsoft's 14B Parameter Model Challenges a 671B Giant AI Agent: Reinforcement Learning Redefines Mathematical Reasoning

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Meta AI partners with 8 global media outlets: Real-time news response + external link traffic

Aaru Series A Behind the Scenes: Redpoint Leads, Multi-Layer Valuation Below 1 Billion, AI Simulation Population Market Attracts More Funding

Perplexity Launches BrowseSafe: Protecting AI Browser Proxies

A Milestone in Cantonese Digitalization! Guangzhou University Launches the AI-DimSum Multimodal Corpus Platform

DeepMind CEO Predicts Three AI Development Trends in 2026

406B Parameters Drop! Tencent Hunyuan 2.0 Opens Internal Testing, Claims Top Tier Inference Performance

Microsoft Launches VibeVoice-Realtime: A New Real-Time Text-to-Speech Model for Interactive Applications

70% of professionals in the creative industry feel social pressure due to using AI, worrying about unemployment

GPT-5.2 Released Early! OpenAI Sounds Red Alert to Counter Gemini 3, Claims 18% Improvement in Reasoning Speed

Grok 4.20 Stocks: Becoming a God in Stock Trading - 10,000 Dollars Turned into 12,000 in 2 Weeks with a 12% Return Rate, Outperforming GPT-5.1 and Gemini 3.0

GEO Services