On January 30, following the release of the spatial perception model, embodied large model, and world model "three consecutive launches," Ant Lingbo Technology announced the open-source release of the embodied world model LingBot-VA today. LingBot-VA introduces a novel autoregressive video-action world modeling framework, deeply integrating large-scale video generation models with robot control. The model generates the next world state while directly simulating and outputting corresponding action sequences, enabling robots to "simulate and act" like humans.

In real-robot evaluations, LingBot-VA demonstrated strong adaptability to complex physical interactions. Facing three categories of six challenging tasks - long-term tasks (making breakfast, picking up screws), high-precision tasks (inserting test tubes, opening packages), and manipulation of flexible and jointed objects (folding clothes, folding pants) - it only required 30~50 real-robot demonstration data samples to adapt, and the task success rate was on average 20% higher than the industry's strong baseline Pi0.5.

e138cf75f4e8d2b0368b60e695f71985.jpg

(Figure caption: In real-robot evaluations, LingBot-VA outperformed the industry benchmark Pi0.5 in multiple difficult operation tasks)

In simulation evaluations, LingBot-VA achieved over 90% success rate on the high-difficulty dual-arm collaborative operation benchmark RoboTwin2.0 for the first time, and reached an average success rate of 98.5% on the long-term lifelong learning benchmark LIBERO, both setting new industry records.

e3d24241d3bd0fc829e3f306ab1f476a.png

(Figure caption: LingBot-VA breaks the current SOTA in LIBERO and RoboTwin 2.0 simulation benchmark tests)

According to reports, LingBot-VA adopts the Mixture-of-Transformers (MoT) architecture, achieving cross-modal fusion between video processing and action control. Through a unique closed-loop simulation mechanism, the model incorporates real-world real-time feedback at each step of generation, ensuring that the generated images and actions remain consistent with physical reality, thus enabling robots to complete complex and difficult tasks.

To overcome the computational bottlenecks of large-scale video world models on robot edge devices, LingBot-VA designed an asynchronous inference pipeline, parallelizing action prediction and motor execution; meanwhile, it introduced a persistent mechanism based on memory cache and noise history enhancement strategy, allowing stable and precise action instructions to be output with fewer generation steps during inference. These optimizations enable LingBot-VA to possess both the deep understanding of large models and the low-latency response speed required for real-robot control.

Ant Lingbo stated that, following the previous open-source releases of LingBot-World (simulation environment), LingBot-VLA (intelligent base), and LingBot-Depth (spatial perception), LingBot-VA has explored a new path of "world model empowering embodied operations." Ant Group will continue to rely on the InclusionAI community to promote open source and collaboration with the industry to build foundational capabilities for embodied intelligence, accelerating the construction of an AGI ecosystem that is deeply integrated with open source and serves real industrial scenarios.

Currently, the model weights and inference code of LingBot-VA are fully open-sourced.