Qwen, a team under Alibaba Group, has announced the release of a new series of preference modeling models called WorldPM, including WorldPM-72B and its derivative versions such as WorldPM-72B-HelpSteer2, WorldPM-72B-RLHFLow, and WorldPM-72B-UltraFeedback. This announcement has drawn significant attention from the global AI developer community, considered a major breakthrough in the field of preference modeling.

image.png

WorldPM: A New Exploration in Preference Modeling

WorldPM (World Preference Modeling) is Qwen's latest achievement in the field of preference modeling. According to official introductions, this model validated that preference modeling follows similar scaling laws to language modeling through training on over 15 million preference data points. This discovery suggests that as data and model scales expand, preference models can learn unified preference representations, significantly improving performance in supervised learning.

WorldPM-72B series is built with a 7.2 billion parameter scale and is specifically designed for evaluating and optimizing outputs from other models. The official statement indicates that fine-tuning based on WorldPM significantly improves performance compared to training from scratch, particularly excelling in scenarios requiring capturing human preferences. This makes it an ideal tool for reinforcement learning and supervised fine-tuning, offering developers an efficient path for model optimization.

Open Source Strategy: Empowering Global Developers

Qwen consistently upholds the open-source spirit; all WorldPM series models are released under the Apache2.0 license and are now available on Hugging Face for free download and use by global developers. This open strategy not only lowers technical barriers but also further solidifies Qwen's leading position in the global open-source AI ecosystem. Developer feedback on X platform describes the release of WorldPM as "a new milestone in the open-source model ecosystem," sparking heated discussions, especially in the fields of supervised learning and preference assessment.

It is worth noting that WorldPM is not a general conversational model but focuses on providing preference scoring and optimization guidance for other models. For instance, developers can use WorldPM-72B to score responses from generative AI, thereby improving model performance on specific tasks. This specialized positioning allows it to play a critical role in the AI development chain.

Technical Highlights: Balancing Scale and Efficiency

During the development of WorldPM, the Qwen team collected diverse preference data from public forums, covering multiple user communities to ensure adaptability across different cultural and linguistic backgrounds. With a training data size of 15 million, combined with model architectures ranging from 1.5B to 72B parameters, WorldPM demonstrates strong generalization capabilities. Official technical reports indicate that WorldPM performs excellently in adversarial evaluations, showing a power-law decline in test loss, indicating that the model can effectively identify answers containing intentional errors, as well as irrelevant or incomplete responses.

In addition, the optimization of style biases in WorldPM is noteworthy. As the model scale expands, WorldPM gradually exhibits style-neutral characteristics, overcoming common bias issues in subjective assessments. This makes it particularly outstanding in objective domains, especially showcasing significant advantages in tasks requiring precise reasoning, such as coding and mathematics.

Hugging Face: https://huggingface.co/Qwen/WorldPM-72B