In the wave of development of large language models (LLMs), the Tongyi Qwen team at Alibaba recently introduced an innovative reinforcement learning method called Soft Adaptive Policy Optimization (SAPO). The core goal of this method is to address the issue of unstable policy optimization in current large language models during reinforcement learning.

Traditional reinforcement learning methods, such as GRPO and GSPO, use hard clipping techniques to control the range of importance ratios, ensuring stability during the update process. However, this approach has inherent drawbacks. First, overly strict clipping often leads to the loss of effective learning signals, especially in GSPO, where if certain tokens perform poorly, the entire sequence's gradient may be discarded. Second, adjusting the clipping range is very tricky: if the range is too small, many samples may not contribute gradients; if the range is too large, it can introduce noise, which actually harms the stability of learning. These issues are particularly significant in large-scale mixture-of-experts (MoE) models.
To address these challenges, the Qwen team proposed SAPO, a new type of reinforcement learning method aimed at improving the stability and performance of large language models. SAPO uses a smooth, temperature-controlled gate function to replace traditional hard clipping, thus retaining more effective gradients while maintaining stability. Its unique design includes:
1. Continuous trust region: Avoids the discontinuity issues caused by hard clipping.
2. Sequence-level consistency: Ensures no entire sequences are discarded, preserving more information.
3. Token-level adaptability: Reduces the impact of abnormal tokens on the overall learning.
Furthermore, SAPO uses an asymmetric temperature design when handling positive and negative tokens, allowing for differentiated processing of different types of tokens, which further enhances the learning effect. Experimental results have shown that SAPO demonstrates significant improvements across various scales of dense and MoE models.
To verify the effectiveness of this new method, the Qwen team conducted a comprehensive evaluation. In tasks such as mathematical reasoning, code generation, logical reasoning, and multimodal mathematical reasoning, SAPO outperformed traditional methods like GRPO and GSPO. This breakthrough not only marks an innovation in the field of large language models by Alibaba Tongyi but also opens up new directions for future AI research.
Paper link: https://arxiv.org/abs/2511.20347




