Alibaba Tongyi Launches a New Reinforcement Learning Method SAPO to Make Large Language Models More Stable and Powerful

AIbase基地

Published inAI News · 4 min read · Dec 10, 2025

In the wave of development of large language models (LLMs), the Tongyi Qwen team at Alibaba recently introduced an innovative reinforcement learning method called Soft Adaptive Policy Optimization (SAPO). The core goal of this method is to address the issue of unstable policy optimization in current large language models during reinforcement learning.

Traditional reinforcement learning methods, such as GRPO and GSPO, use hard clipping techniques to control the range of importance ratios, ensuring stability during the update process. However, this approach has inherent drawbacks. First, overly strict clipping often leads to the loss of effective learning signals, especially in GSPO, where if certain tokens perform poorly, the entire sequence's gradient may be discarded. Second, adjusting the clipping range is very tricky: if the range is too small, many samples may not contribute gradients; if the range is too large, it can introduce noise, which actually harms the stability of learning. These issues are particularly significant in large-scale mixture-of-experts (MoE) models.

To address these challenges, the Qwen team proposed SAPO, a new type of reinforcement learning method aimed at improving the stability and performance of large language models. SAPO uses a smooth, temperature-controlled gate function to replace traditional hard clipping, thus retaining more effective gradients while maintaining stability. Its unique design includes:

1. Continuous trust region: Avoids the discontinuity issues caused by hard clipping.

2. Sequence-level consistency: Ensures no entire sequences are discarded, preserving more information.

3. Token-level adaptability: Reduces the impact of abnormal tokens on the overall learning.

Furthermore, SAPO uses an asymmetric temperature design when handling positive and negative tokens, allowing for differentiated processing of different types of tokens, which further enhances the learning effect. Experimental results have shown that SAPO demonstrates significant improvements across various scales of dense and MoE models.

To verify the effectiveness of this new method, the Qwen team conducted a comprehensive evaluation. In tasks such as mathematical reasoning, code generation, logical reasoning, and multimodal mathematical reasoning, SAPO outperformed traditional methods like GRPO and GSPO. This breakthrough not only marks an innovation in the field of large language models by Alibaba Tongyi but also opens up new directions for future AI research.

Paper link: https://arxiv.org/abs/2511.20347

AI Daily: xAI launches Grok4.20; Meituan introduces AI search product 'Wen Xiaotuan'; Baidu Health tests AI doctor assistant DoctorClaw

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technology trends and innovative AI product applications. Discover fresh AI products: https://app.aibase.com/zh1, xAI launches Grok4.20: Significant improvement in reasoning performance, non-hallucination rate of 78% sets an industry record. xAI launches Grok4.20, with significant improvement in reasoning performance, non-hallucination rate as high as 78%

Meituan CEO Wang Xing: AI Agent Has a Greater Impact on Me Than ChatGPT

Meituan CEO Wang Xing stated at a management meeting on March 13, 2026, that AI's transformative impact will far exceed that of the internet. He compared mobile internet to traditional internet as 'rose vs. peony,' and AI to internet as 'monkey vs. flower,' highlighting AI's greater scale and influence. He urged businesses and individuals to embrace AI changes, noting AI Agent's personal impact on him.....

Xie Saining, the author of DiT, releases another breakthrough! Multi-person video world model Solaris launched, with a seed round valuation exceeding $3.5 billion

In March 2026, Xie Saining's team launched Solaris, the world's first multi-person video world model, transitioning from single-view to interactive multi-player environments. Built on the Kunlun Tiangang open-source framework, it uses multi-person self-attention layers to enhance architectural consistency and validate collaborative perception in virtual worlds.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Alibaba Tongyi Launches a New Reinforcement Learning Method SAPO to Make Large Language Models More Stable and Powerful

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: xAI launches Grok4.20; Meituan introduces AI search product 'Wen Xiaotuan'; Baidu Health tests AI doctor assistant DoctorClaw

Technical Optimization Still Needs Refinement: Meta Announces Llama4 Release Plan Delayed to May

Meituan CEO Wang Xing: AI Agent Has a Greater Impact on Me Than ChatGPT

The Final Evolution of AI Assistants: Gemini Task Automation Goes Live, Smartphones Start Doing Things for You

Google YouTube TV Introduces AI-Powered Targeted Advertising: 30-Second Mandatory Ads Go Live

a16z Releases Global AI Consumer Application Top 100 List: ChatGPT Ranks First

a16z Releases Global AI Application Top 100 List: DeepSeek Ranks in the Top Four, Chinese Group Rises Collectively

Xie Saining, the author of DiT, releases another breakthrough! Multi-person video world model Solaris launched, with a seed round valuation exceeding $3.5 billion

Addressing AI Safety Issues: OpenAI Acquires AI Safety Startup Promptfoo

Fried Shrimp! Hong Kong Stock OpenClaw Concept Stocks Suddenly Plunged, MiniMax Dropped Nearly 9%

AI News Recommendations

AI Daily: xAI launches Grok4.20; Meituan introduces AI search product 'Wen Xiaotuan'; Baidu Health tests AI doctor assistant DoctorClaw

Technical Optimization Still Needs Refinement: Meta Announces Llama4 Release Plan Delayed to May

Meituan CEO Wang Xing: AI Agent Has a Greater Impact on Me Than ChatGPT

The Final Evolution of AI Assistants: Gemini Task Automation Goes Live, Smartphones Start Doing Things for You

Google YouTube TV Introduces AI-Powered Targeted Advertising: 30-Second Mandatory Ads Go Live

a16z Releases Global AI Consumer Application Top 100 List: ChatGPT Ranks First

a16z Releases Global AI Application Top 100 List: DeepSeek Ranks in the Top Four, Chinese Group Rises Collectively

Xie Saining, the author of DiT, releases another breakthrough! Multi-person video world model Solaris launched, with a seed round valuation exceeding $3.5 billion

Addressing AI Safety Issues: OpenAI Acquires AI Safety Startup Promptfoo

Fried Shrimp! Hong Kong Stock OpenClaw Concept Stocks Suddenly Plunged, MiniMax Dropped Nearly 9%

GEO Services