Recently, the Seed team at ByteDance officially launched a new Vision-Language-Action Model (VLA) called GR-3. This model demonstrates breakthrough capabilities in robotic manipulation, not only understanding abstract language instructions but also precisely handling flexible objects. It also has the ability to quickly adapt to new tasks and recognize new objects. This achievement is considered an important step towards building a general "brain" for robots.

Traditional robotic operation models often rely on large amounts of robot trajectory data for training, which leads to high costs and low efficiency when transferring to new tasks. GR-3, however, can achieve efficient fine-tuning with only a small amount of human data. Its core breakthrough lies in using a Mixture-of-Transformers (MoT) network structure, integrating visual-language modules and action generation modules into a 4 billion parameter end-to-end model. The action generation module generates actions using a Diffusion Transformer (DiT) combined with Flow-Matching technology, and introduces a normalized RMSNorm design, significantly enhancing dynamic instruction following capabilities. This structure allows GR-3 to plan continuous actions directly from camera footage and language instructions, like automatically completing the entire process of "packing leftovers → clearing dishes → throwing away garbage" after hearing "cleaning the table."

微信截图_20250722140449.png

In terms of training data, GR-3 breaks through the limitations of a single data source and achieves a significant performance boost through a three-in-one data training method: first, using high-quality real machine data collected by teleoperated robots to ensure basic operational capabilities; second, collecting human trajectory data through user-authorized VR devices, which nearly doubles the efficiency of learning new tasks (450 lines per hour vs. 250 lines per hour for traditional methods); third, combining publicly available image-text data to help the model understand abstract concepts such as "big," "small," and "left/right," and identify features of unseen objects. This diverse data fusion strategy enables GR-3 to achieve a 17.8% higher success rate in object grasping tasks compared to baseline models, and with just 10 human trajectory data points, it can increase the success rate of operating new objects to over 80%.

To verify the model's performance, the team conducted systematic testing on three major tasks: general pick-and-place, long-range table cleaning, and flexible clothing handling. In the general pick-and-place task, GR-3 achieved a command compliance rate and success rate of 98.1% and 96.3%, respectively, in trained scenarios. In new environments (such as bedroom desks or supermarket counters), its performance almost showed no decline, and it could accurately handle complex instructions involving spatial relationships, such as "putting the cola next to the Sprite into the plate." In the long-range table cleaning task, GR-3 can autonomously complete multi-step operations with an average completion rate exceeding 95% and strictly follow step-by-step instructions. When faced with invalid instructions, it can accurately determine not to act. The flexible clothing handling test showed that GR-3 achieved an 86.7% completion rate in hanging clothes tasks, and even when dealing with unfamiliar clothing styles or messy arrangements, it could still stably complete the task.

Collaboration with hardware is another highlight of GR-3. The team developed a general dual-arm mobile robot called ByteMini as a platform, equipped with 22 full-body degrees of freedom and a unique wrist ball angle design, combined with a whole-body motion control (WBC) system, enabling precise operations and smooth trajectory generation in narrow spaces. For example, when grabbing a paper cup, it can automatically adjust the force to avoid crushing it, and the robotic arm can rotate flexibly like a human wrist. A multi-camera layout (two wrist cameras for details and one head camera for a global view) ensures the perception capability of "seeing everything around."

Although GR-3 has already surpassed previously testable top VLA models in the industry, such as π0, the team plans to further improve its generalization capabilities by expanding the model size and increasing the amount of training data (such as more visual-language data of various objects and complex task robot data). At the same time, they aim to introduce reinforcement learning (RL) methods to overcome the limitations of imitation learning, allowing robots to autonomously adjust strategies when encountering unexpected situations like objects slipping, thus enhancing their resistance to interference.

The Seed team at ByteDance stated that the development of GR-3 aims to solve three major bottlenecks in traditional robotics: "not understanding abstract instructions," "not adapting to environmental changes," and "not performing long-term tasks well." In the future, the team will continue to explore the deep integration of large models and robotics technology, promoting the general robot "brain" into daily life and becoming an intelligent assistant that helps humans handle various tasks. This achievement not only provides a new paradigm for the field of robotic learning but also brings the vision of a "robotic all-in-one assistant" closer to reality.

ArXiv: https://arxiv.org/abs/2507.15493

Project Homepage: https://seed.bytedance.com/GR3