Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

AIbase基地

Published inAI News · 7 min read · Jul 3, 2025

Large language models (LLMs) have made significant progress in complex reasoning tasks by combining task prompts with large-scale reinforcement learning (RL), such as the Deepseek-R1-Zero model, which directly applies reinforcement learning to base models, demonstrating strong reasoning capabilities. However, this success is difficult to replicate across different base model series, especially for the Llama series. This raises a core question: what factors lead to inconsistent performance of different base models during reinforcement learning?

Extension Limitations of Reinforcement Learning on Llama Models

Models such as OpenAI's o1, o3, and DeepSeek's R1 have achieved breakthroughs in competition-level math problems through large-scale reinforcement learning, driving exploration of the reinforcement learning capabilities of small models below 1 trillion parameters. However, these advances are mostly limited to the Qwen model series and are difficult to reproduce on models like Llama. The lack of transparency in the pre-training process makes it challenging to understand how pre-training affects the scalability of reinforcement learning. Some non-traditional studies found that one-time prompts can improve Qwen's reasoning ability, but they have little effect on Llama. Although projects such as OpenWebMath and MathPile aim to compile high-quality mathematical pre-training corpora, their scale remains limited to less than 1 trillion tokens.

Exploring Stable Decay Strategies During Training

Researchers from Shanghai Jiao Tong University conducted an in-depth study on the impact of mid-training strategies on the dynamics of reinforcement learning, using Qwen and Llama as research subjects, and drew the following insights:

First, high-quality math corpora such as MegaMath-Web-Pro can simultaneously enhance the performance of base models and reinforcement learning. Second, using question-and-answer data, especially data containing long Chain-of-Thought (CoT) reasoning, can further enhance the effectiveness of reinforcement learning. Third, long CoT introduces lengthiness and instability during reinforcement learning training. Finally, applying expansion during mid-training can improve the performance of downstream reinforcement learning.

The researchers proposed a two-phase mid-training strategy called "Stable-Decay": first, train the base model using 200 billion tokens, then train three CoT-centered branches using 20 billion tokens. Ultimately, this strategy successfully generated the OctoThinker model, which has strong reinforcement learning compatibility.

RL Configuration and Benchmark Evaluation

Researchers used the MATH8K dataset for reinforcement learning (RL) training prompts, with configurations including a global training batch size of 128, 16 rollout responses per query, and a PPO minimum batch size of 64. Experiments were conducted on the Llama-3.2-3B-Base and Qwen2.5-3B-Base models. In the evaluation, base language models used few-shot prompts, while the reinforcement learning-optimized models used zero-shot prompts on benchmark tasks such as GSM8K, MATH500, OlympiadBench, and AMC23.

During reinforcement learning training, the response length of the Qwen model continuously increased and remained within a reasonable range, while the Llama model exhibited abnormal behavior, with an average response length soaring to 4,096 tokens. Evaluation results further indicate that the Qwen2.5-3B model optimized through reinforcement learning showed improvements across all benchmarks, whereas the improvements in the Llama-3.2-3B model were minimal.

OctoThinker Outperforms Llama in RL Compatibility

In 13 mathematical benchmarks, each OctoThinker branch outperformed the original Llama base model by 10%-20% and achieved consistent improvements across all stable phase models. The OctoThinker-Zero series demonstrated diverse thinking behaviors during reinforcement learning expansion, with the OctoThinker-Long variant performing exceptionally well. When comparing three 3B-scale base models during reinforcement learning training, the OctoThinker-Long-3B outperformed the original Llama-3.2-3B model and reached a performance level close to the Qwen2.5-3B model, which is known for its strong reasoning capabilities and extensive pre-training. The performance of mixed branches and short branches was slightly lower, especially in more challenging benchmarks.

Conclusion and Future Work: Toward RL-Ready Base Models

This study thoroughly explored the reasons behind the differences in behavior of base models such as Llama and Qwen during the reinforcement learning reasoning process and emphasized the importance of mid-training for the scalability of reinforcement learning. The two-phase mid-training strategy successfully transformed Llama into a more suitable base model for reinforcement learning, ultimately resulting in the OctoThinker model.

Uncovering the Secrets of Large Models! The 'Thinking Words' Behind Them Contain Astonishing Information

Recently, a research team from Renmin University, Shanghai Artificial Intelligence Laboratory, University College London, and Dalian University of Technology revealed an important finding in the reasoning process of large models: when the model is thinking, the 'thinking words' it uses actually reflect a significant increase in its internal information. This research result provides a new perspective for better understanding the reasoning mechanisms of artificial intelligence through methods of information theory. You may have seen large models output some language that seems human-like when answering questions, such as "Hmm..." or "Let me think...".

DeepMind introduces Crome: Enhancing the Alignment of Large Language Models with Human Feedback

In the field of artificial intelligence, reward models are a critical component for aligning large language models (LLMs) with human feedback, but existing models face the issue of "reward hacking." These models often focus on superficial features, such as the length or format of responses, rather than identifying genuine quality metrics, such as factual accuracy and relevance. The root cause lies in standard training objectives failing to distinguish between spurious associations and true causal drivers present in the training data. This failure leads to fragile reward models (RMs), which generate misaligned policies.

China's Medical Large Model Release Volume Accounts for 70% of the Global Total! KPMG Reveals Future Market Potential

According to KPMG China's recent report, "The First 50 Health Tech Companies," China accounts for more than 70% of the global release volume of medical large models. This data not only demonstrates China's rapid development in the field of intelligent healthcare, but also reflects the wide application of large language models in the healthcare industry. The report points out that about 65% of the currently released medical large models are large language models. These models can process and generate natural language, playing a significant supporting role in the analysis of medical data, patient communication, and scientific research.

New Developments in OpenAI Copyright Lawsuit: The New York Times Will Have Access to Deleted User Data

In the long-standing copyright infringement lawsuit filed by The New York Times against OpenAI, the case has made significant progress. According to Ars Technica, the federal judge presiding over the case has authorized The New York Times and its co-plaintiffs, The New York Daily News and the Investigative Reporting Center, to access OpenAI's user logs, including deleted content, to accurately determine the scope of the infringement. The New York Times believes that ChatGPT users may delete their history after bypassing the paywall, and therefore it is necessary to conduct large-scale data collection.

Xiaopeng G7 Ultra Makes a Grand Debut! Revolutionary Intelligent Driving Large Model Unveiled

In the new energy vehicle market, Xiaopeng Automotive has once again drawn attention. On July 3rd, the Xiaopeng G7 Ultra was officially launched, becoming the first intelligent vehicle equipped with the local-end "VLA+VLM" large model. This innovative technology marks an important step forward for Xiaopeng in the field of intelligent driving. The Xiaopeng G7 Ultra is equipped with the VLA (active thinking and rapid decision-making capability) large model, making the driving experience more intelligent. In daily driving, the G7 Ultra can flexibly handle various complex driving scenarios, such as in traffic.

Shortcut Makes Its Debut! AI Excel Assistant Surpasses Human Champions by 10 Times, Task Automation Efficiency Soars

Recently, an AI Excel assistant called Shortcut has sparked heated discussions on social media. It enables users to effortlessly complete Excel tasks without writing complex formulas or VBA code through natural language processing (NLP) technology. The AIbase editorial team has compiled the latest information from social media to provide an in-depth analysis of Shortcut's powerful features and its potential impact on the fields of data processing and financial modeling. Shortcut: An Excel Revolution Driven by Natural Language

KPMG Report: China Leads in Medical Large Models, Accounting for 70% of the Global Total

A recent report titled "Health Tech 50 - The First Edition" released by KPMG China reveals that China has taken a leading position in the field of medical large models globally. The report indicates that the number of medical large models launched in China accounts for more than 70% of the global total, far surpassing other countries and regions. In terms of model categories, large language models (LLMs) are the most numerous, accounting for nearly 65%. Moreover, the report also highlights the strong growth momentum of the intelligent medical devices market in China. It is expected that by 2025, the scale of the intelligent medical devices market in China will reach 24.23 billion yuan, and it will continue to grow.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview