The Tongyi Lab's Intelligent Computing team has officially released a new algorithm in the field of post-training for large models today—FIPO (Future-KL Influenced Policy Optimization). This algorithm introduces an innovative "Future-KL" mechanism, effectively addressing the common technical bottleneck of "inference length stagnation" in pure reinforcement learning (Pure RL) training processes.

In training for long-text reasoning and complex logical alignment, traditional reinforcement learning often struggles to accurately capture key decision points in long sequences. The FIPO algorithm developed by the Tongyi team implements differentiated reward allocation for key tokens, guiding the model to be more forward-looking during the chain-of-thought (CoT) generation process.

Experimental data shows that, under a pure reinforcement learning setting with a 32B-scale model, the model equipped with the FIPO algorithm has already surpassed similarly scaled models such as DeepSeek-Zero-MATH and OpenAI's o1-mini, marking substantial progress in the logical reasoning and mathematical computing capabilities of domestic large models.

Currently, the focus of the competition among large models is shifting from pre-training scale to deep alignment on the inference side. The release of the FIPO algorithm not only provides a new approach to evaluating the quality of "thinking processes" in logical reasoning models, but also indicates that the open-source community and leading domestic laboratories are gradually building an independent technological development path in their pursuit of global top-tier reasoning models.