AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

Google DeepMind Fine-tunes AI Decision-making via Reinforcement Learning

AIbase基地

Published inAI News · 5 min read · May 20, 2025

Recently, the Google DeepMind team collaborated with the LIT AI Lab at Johannes Kepler University Linz on a new study regarding artificial intelligence language models. They employed reinforcement learning fine-tuning (RLFT) techniques to enhance the decision-making capabilities of these models. The focus of this research was to address some critical issues in the decision-making process of models by reinforcing training through chains of reasoning.

Gemini, Google DeepMind, Artificial Intelligence, AI

With the application of big data, existing language models have demonstrated superior capabilities in processing text and can even make knowledge-based decisions in interactive environments. However, these models often encounter problems where they appear to be "all talk and no action" when making real-world decisions—they can derive correct strategies but fail to execute them effectively. Additionally, they tend to opt for choices that yield higher short-term rewards. Smaller models also frequently exhibit frequency bias, repeatedly performing common actions.

Traditional reinforcement learning methods, such as the UCB algorithm, can balance exploration and exploitation to some extent but still cannot fully resolve the disconnect between model reasoning and action. To address this, the DeepMind team innovatively introduced reinforcement learning fine-tuning technology, using self-generated chains of reasoning as training signals. The system evaluates the rewards corresponding to each reasoning step, encouraging the model to prioritize logically consistent and effective action plans.

In practical implementation, the model generates sequences containing reasoning processes and actions based on input instructions and historical actions and rewards. Through Monte Carlo baseline evaluation and generalized advantage estimation optimization, ineffective actions trigger penalty mechanisms. Moreover, the introduction of reward shaping not only ensures output standardization but also retains exploration space.

In experiments, the research team tested multi-armed bandit models. In the 10-arm test, the 2B-parameter model’s action coverage improved by 12 percentage points. In the 20-arm test, although the improvement was less significant, the frequency bias rate dropped from 70% to 35%, demonstrating the effectiveness of the research. Results from the tic-tac-toe experiments showed that the model's win rate against random opponents increased fivefold, and its average return against optimal Monte Carlo tree search agents rose from -0.95 to zero. Furthermore, the probability of the 27B large model generating correct reasoning reached 87%, compared to only 21% executing optimal actions without fine-tuning. These data clearly demonstrate the effectiveness of reinforcement learning fine-tuning in narrowing the gap between reasoning and execution.

Key Takeaways:

📊 The study uses reinforcement learning fine-tuning (RLFT) technology to enhance AI language models' decision-making capabilities.

🧩 Training through self-generated chains of reasoning effectively improves the logical reasoning and action selection of the model.

🏆 Experiments show that the model significantly improved performance in multi-armed bandits and tic-tac-toe, narrowing the gap between reasoning and execution.

ReinforcementLearningfromHumanFeedback(RLHF)AlphaGoDeepMind ThoughtChain LanguageModel

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Kunlun Xiwang Once Again Open-Sources the Reward Model Skywork-Reward-V2

On July 4, 2025, Kunlun Xiwang continued to open-source the second-generation reward model Skywork-Reward-V2 series. This series includes 8 reward models based on different foundation models, with parameter sizes ranging from 600 million to 8 billion. Upon its release, it won all seven major reward model evaluation rankings, becoming a focus in the open-source reward model field. Reward models play a key role in the reinforcement learning from human feedback (RLHF) process. To build the next generation of reward models, Kunlun Xiwang has constructed a dataset containing 40 million

Jul 4, 2025

200

UIUC and Google Release Search-R1: A Large Language Model That Can Search and Answer Questions

A groundbreaking new AI technology allows language models to search the internet for information! Not only has this resulted in a 41% increase in exam scores, but it also unlocks a new level of reasoning and search capabilities. Learn about this academic 'cheat code' evolution and why you might want to get your AI a library card! Paper: https://arxiv.org/abs/2503.09516 Code: https://github.com/PeterGriffinJin/Search-R

Apr 21, 2025

700

From Text to Complex Characters: The OmniSVG, the Most Powerful SVG Generation Model, Has Arrived!

On April 9th, 2025, a powerful SVG (Scalable Vector Graphics) generation model named OmniSVG was officially unveiled, marking a new stage in vector graphic generation technology. Jointly developed by StepFun and Fudan University, this model is hailed as the most advanced SVG generation model currently available. Its outstanding multi-modal generation capabilities and efficient performance have attracted widespread attention. OmniSVG's technological breakthrough is based on a pre-trained Vision-Language Model (VLM)...

Apr 10, 2025

690

JD Retail Launches its First Self-Developed Billion-Parameter Time Series Model, TimeHF, for Predicting Product Sales

JD Retail's technology team announced the successful launch of TimeHF, its first self-developed billion-parameter time series model for sales forecasting. This model leverages Reinforcement Learning from Human Feedback (RLHF), applying it for the first time to sales forecasting. The accuracy of predictions has been significantly improved by over 10%, substantially reducing uncertainty in demand forecasting. This achievement has demonstrated excellent performance in JD's internal automated replenishment scenarios for 20,000 products.

Apr 10, 2025

720

Report: Microsoft Launches New AI Strategy, Developing MAI Model to Challenge OpenAI

Mar 8, 2025

530

Northwestern Polytechnical University Open Source Voice Understanding Model OSUM, Integrating Whisper and Qwen2, Supports 8 Voice Understanding Tasks

Feb 20, 2025

3.8k

Mistral AI Releases Saba: An AI Model Focused on Middle Eastern and Southeast Asian Languages

Feb 18, 2025

1.3k

iFLYTEK Starfire Simultaneous Translation Voice Model Released: Achieving Human Expert Translator Level

Today, iFLYTEK officially launched its latest research and development achievement, the Starfire simultaneous translation voice model, marking the debut of the first domestic large model with end-to-end speech simultaneous translation capabilities. This innovative technology has significantly improved the translation performance across all scenes compared to iFLYTEK's previous translation technologies, and has greatly shortened the end-to-end response time.

Jan 15, 2025

2.6k

Ali Launches New AI Benchmark 'PROCESSBENCH' to Assess Error Recognition Capability in Mathematical Reasoning

Dec 15, 2024

2.4k

A Blessing for Researchers! The Latest AI2 Tool OpenScholar is Here, Boosting Research Efficiency by 10 Times! No More Staying Up Late for Literature Reviews

Nov 27, 2024

3.5k