Recently, the Google DeepMind team collaborated with the LIT AI Lab at Johannes Kepler University Linz on a new study regarding artificial intelligence language models. They employed reinforcement learning fine-tuning (RLFT) techniques to enhance the decision-making capabilities of these models. The focus of this research was to address some critical issues in the decision-making process of models by reinforcing training through chains of reasoning.

Gemini, Google DeepMind, Artificial Intelligence, AI

With the application of big data, existing language models have demonstrated superior capabilities in processing text and can even make knowledge-based decisions in interactive environments. However, these models often encounter problems where they appear to be "all talk and no action" when making real-world decisions—they can derive correct strategies but fail to execute them effectively. Additionally, they tend to opt for choices that yield higher short-term rewards. Smaller models also frequently exhibit frequency bias, repeatedly performing common actions.

Traditional reinforcement learning methods, such as the UCB algorithm, can balance exploration and exploitation to some extent but still cannot fully resolve the disconnect between model reasoning and action. To address this, the DeepMind team innovatively introduced reinforcement learning fine-tuning technology, using self-generated chains of reasoning as training signals. The system evaluates the rewards corresponding to each reasoning step, encouraging the model to prioritize logically consistent and effective action plans.

In practical implementation, the model generates sequences containing reasoning processes and actions based on input instructions and historical actions and rewards. Through Monte Carlo baseline evaluation and generalized advantage estimation optimization, ineffective actions trigger penalty mechanisms. Moreover, the introduction of reward shaping not only ensures output standardization but also retains exploration space.

In experiments, the research team tested multi-armed bandit models. In the 10-arm test, the 2B-parameter model’s action coverage improved by 12 percentage points. In the 20-arm test, although the improvement was less significant, the frequency bias rate dropped from 70% to 35%, demonstrating the effectiveness of the research. Results from the tic-tac-toe experiments showed that the model's win rate against random opponents increased fivefold, and its average return against optimal Monte Carlo tree search agents rose from -0.95 to zero. Furthermore, the probability of the 27B large model generating correct reasoning reached 87%, compared to only 21% executing optimal actions without fine-tuning. These data clearly demonstrate the effectiveness of reinforcement learning fine-tuning in narrowing the gap between reasoning and execution.

Key Takeaways:

📊 The study uses reinforcement learning fine-tuning (RLFT) technology to enhance AI language models' decision-making capabilities.  

🧩 Training through self-generated chains of reasoning effectively improves the logical reasoning and action selection of the model.  

🏆 Experiments show that the model significantly improved performance in multi-armed bandits and tic-tac-toe, narrowing the gap between reasoning and execution.