Tencent Open Sources WeChat-YATT Large Model Training Library to Crack Two Core Bottlenecks in Multimodal Training

AIbase基地

Published inAI News · 7 min read · Aug 19, 2025

Tencent recently released the large model training library WeChat-YATT (Yet Another Transformer Trainer), developed based on Megatron-Core and SGLang/vLLM, with an internal project code name of gCore. This training library focuses on reinforcement learning and multi-modal model training, aiming to provide developers with an expandable, simple, efficient, and reliable large model training solution.

WeChat-YATT can effectively handle complex scenarios such as large-scale models, long sequence inputs, and large datasets through customized parallel computing strategies, successfully solving key pain points in multiple practical business scenarios within WeChat, significantly improving the efficiency of large model training. The tool provides researchers and developers with a flexible and scalable technical solution, which is expected to drive innovation and development in the fields of multi-modal and reinforcement learning.

WeChat-YATT focuses on solving two major core technical bottlenecks encountered during the distributed training of large models.

The first is the scalability bottleneck in multi-modal scenarios. As the scale of multi-modal data such as images and videos continues to grow, traditional architectures that rely on a single controller for data management tend to become communication and memory bottlenecks, limiting system throughput and even causing training processes to fail unexpectedly. WeChat-YATT addresses this by introducing a parallel management mechanism with parallel controllers, effectively distributing system pressure and significantly enhancing system scalability and stability, better handling complex application scenarios involving multi-modal and large data volumes.

The second is the efficiency gap under dynamic sampling and generative reward calculation. In training workflows that require frequent dynamic sampling or generative reward calculations, frequent model switching and "long-tail" tasks generate significant additional overhead, leading to underutilization of GPU computing power and severely affecting overall training efficiency. WeChat-YATT alleviates model switching costs and the impact of long-tail tasks through partial coexistence strategies and asynchronous interaction mechanisms, achieving high throughput and high resource utilization during the training process, thus better supporting efficient iteration of large-scale RLHF tasks.

Depending on different business scenario requirements, WeChat-YATT supports two resource placement modes: full coexistence and partial coexistence, to maximize cluster resource utilization.

In the full coexistence mode, a serial scheduling mechanism is used, where Actor Rollouts, GenRM (Generative Reward Model), and Train are executed sequentially. After completing their tasks, each role actively releases computing resources, and the system then loads the next task's required model. This strategy is suitable for most conventional training scenarios. Notably, during each phase, the relevant components can exclusively use all GPU resources, greatly reducing the "bubble" time of idle resources and significantly improving the overall training throughput and efficiency.

In the partial coexistence mode, Actor Rollouts and GenRM are deployed independently and interact efficiently through asynchronous methods. During the Actor training phase, all GPU resources are occupied, and during the Rollouts generation phase, GPU resources are released, and the Actor Rollouts and GenRM components work together. The system dynamically evaluates load for resource allocation and balancing. Once the Rollouts are generated, these components release resources, and the Actor reloads onto the GPU for the next training cycle. The partial coexistence mode is particularly suitable for complex tasks requiring frequent interactions and dynamic sampling between Rollouts and GenRM.

WeChat-YATT also features several technical strengths. In terms of memory utilization, the project adopts a parallel controller architecture, effectively reducing memory consumption per node, making it more suitable for large model training in multi-modal scenarios and enhancing system scalability and stability. Regarding GenRM support, different resource placement strategies have been implemented for generative reward model scenarios, allowing users to choose the optimal training solution based on specific scenarios.

The intelligent checkpoint strategy is another highlight. WeChat-YATT supports asynchronous checkpoint saving and automatically saves checkpoints according to the scheduling process based on WeChat's business characteristics, further ensuring the safety and high availability of training. Additionally, the system achieves load balancing among various data parallel groups during training, effectively reducing idle resource time and significantly improving overall training throughput.

The release of this training library marks an important progress in Tencent's large model technology infrastructure construction and also provides an effective solution for handling complex multi-modal training scenarios in the industry.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

Tencent Open Sources WeChat-YATT Large Model Training Library to Crack Two Core Bottlenecks in Multimodal Training

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Microsoft Deepens AI Strategy: Core Integration of Copilot in Windows 11 Supports Voice Control, Screen Analysis, and Local Automation

Anthropic Launches Claude Skills System Supporting Excel, PPT Generation and Custom Modules

Claude Deeply Integrated with Microsoft 365, Boosting Corporate Efficiency!

Hong Kong Monetary Authority Announces AI Sandbox List, Ant Digital Becomes Core Technology Partner

OpenAI Launches Smart Memory Cleanup Feature for ChatGPT, Plus Users Get Early Access

University of Pennsylvania Study Finds: The Ruder the AI is, the Higher the Accuracy Rate

Musk Hires NVIDIA Core Team for xAI to Accelerate World Model Development

Top 10 Engineering Achievements of the World in 2025 Revealed: DeepSeek Selected

OpenAI and Sorel Energy Invest 25 Billion Dollars! Argentina Will Welcome a Super Data Center!

Meitu RoboNeo Launches with Over a Million MAU in the First Month, Wu Xinhong Advocates AI Native

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

Tencent Open Sources WeChat-YATT Large Model Training Library to Crack Two Core Bottlenecks in Multimodal Training

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Microsoft Deepens AI Strategy: Core Integration of Copilot in Windows 11 Supports Voice Control, Screen Analysis, and Local Automation

Anthropic Launches Claude Skills System Supporting Excel, PPT Generation and Custom Modules

Claude Deeply Integrated with Microsoft 365, Boosting Corporate Efficiency!

Hong Kong Monetary Authority Announces AI Sandbox List, Ant Digital Becomes Core Technology Partner

OpenAI Launches Smart Memory Cleanup Feature for ChatGPT, Plus Users Get Early Access

University of Pennsylvania Study Finds: The Ruder the AI is, the Higher the Accuracy Rate

Musk Hires NVIDIA Core Team for xAI to Accelerate World Model Development

Top 10 Engineering Achievements of the World in 2025 Revealed: DeepSeek Selected

OpenAI and Sorel Energy Invest 25 Billion Dollars! Argentina Will Welcome a Super Data Center!

Meitu RoboNeo Launches with Over a Million MAU in the First Month, Wu Xinhong Advocates AI Native

GEO Services