Recently, Professor Zhou Zhihua's team from Nanjing University released an important study, which for the first time theoretically proved that endogenous reward models can be found in large language models, and effectively applied reinforcement learning (RL) to improve model performance.

Currently, many alignment methods rely on human feedback reinforcement learning (RLHF), a method that requires a large amount of high-quality human preference data to train the reward model. However, building such a dataset is not only time-consuming and labor-intensive but also faces challenges of high costs. Therefore, researchers have begun exploring alternative solutions, among which reinforcement learning with AI feedback (RLAIF) has attracted attention. This approach uses the reward signals generated by powerful large language models themselves, reducing reliance on human annotations.

Large Models, Metaverse (2)

Image source note: The image is AI-generated, and the image licensing service provider is Midjourney.

The research team's findings are encouraging: in standard next-token prediction training, a strong general reward model is actually hidden within every large language model. The concept of "endogenous reward" proposed by the team means that we can extract an effective reward mechanism from these models without relying on external evaluation sources. This theory not only provides new ideas for constructing reward models, but also demonstrates how to effectively use the models' own endogenous rewards for fine-tuning, thereby significantly improving model performance.

The research results show that fine-tuning using endogenous rewards can exceed traditional baseline models within the margin of error, especially performing better in complex tasks. The team conducted extensive experiments, and the results showed that this new method outperforms existing reward models and performs well across various tests.

The release of this study undoubtedly opens up new possibilities for the development and application of future large language models. Researchers hope that this strategy of utilizing internal reward mechanisms can reduce development costs, improve efficiency, and promote the broader application of artificial intelligence.