Recently, the Tencent Robotics X Lab, in collaboration with the Tencent HuanYuan team, officially launched HY-Embodied-0.5, a foundational model specifically designed for embodied intelligence. This initiative aims to address the industry challenge that general vision-language models (VLMs) lack fine-grained 3D spatial perception and physical interaction capabilities, making it difficult to apply them in the physical world. It marks a substantial extension of the large model's cognitive chain into the field of robot control.
This series of models is not a simple fine-tuning of a general base model, but a complete reconstruction from architecture to training paradigm. The team also launched two main models: MoT-2B (total parameters 4B, activated 2B), which focuses on real-time response at the edge, and MoE-32B (total parameters 407B, activated 32B), which pursues extreme reasoning performance.
Technically, the team has pioneered a hybrid Transformer (MoT) architecture with non-shared parameters between visual and language modalities, combined with the native resolution visual encoder HY-ViT2.0 and visual latent token mechanism, effectively avoiding catastrophic forgetting in small models during multi-modal training. In terms of training, relying on over 100 million high-quality embodied-specific data, combined with multiple post-training strategies such as rejection sampling fine-tuning, reinforcement learning, and online distillation, drives the model's thinking chain to evolve autonomously.
Performance validation shows that MoT-2B achieved the best results in 16 out of 22 authoritative evaluations covering perception, reasoning, and planning, surpassing similarly parameterized competitors such as Qwen3-VL-4B and RoboBrain2.5; the flagship version MoE-A32B also matches international benchmarks like Gemini3.0Pro in comprehensive performance.
In practical testing, robots equipped with this base model performed better than mainstream baseline models in tasks such as packing and stacking. This advancement provides a high-performance underlying foundation for embodied intelligence to move from virtual simulation to physical operation.


