Today, the LongCat team of Meituan officially released its new video generation model - LongCat-Video. This model, with its ability to accurately reconstruct the operational state of the real world, marks a significant advancement in Meituan's exploration of the "world model" field. A world model is the core engine for the next generation of artificial intelligence, helping AI better understand, predict, and reconstruct the dynamics of the real world.

QQ20251027-102541.png

LongCat-Video is based on an advanced Diffusion Transformer (DiT) architecture, integrating core functions such as text-to-video, image-to-video, and video continuation. This innovative model effectively distinguishes tasks through the setting of "conditional frame count," ensuring excellent generation capabilities under different input conditions. LongCat-Video can output high-definition videos at 720p and 30fps in text-to-video generation, and it has leading semantic understanding and visual presentation capabilities in the open-source field. In addition, image-to-video can strictly preserve the attributes and style of the reference image during dynamic processes, showing natural and smooth motion performance.

The most impressive feature of LongCat-Video is its long video generation capability. Through pre-training on video continuation tasks, the model can stably output coherent long videos of up to 5 minutes, while avoiding common issues such as color drift, quality degradation, and action breakage. This technological breakthrough not only improves the quality of video generation but also provides a solid technical foundation for deep interaction scenarios such as autonomous driving and embodied intelligence.

In terms of efficient inference, LongCat-Video adopts a "two-stage coarse-to-fine generation" strategy, combined with block-sparse attention (BSA) and model distillation optimization, significantly improving the speed and quality of video generation. The inference speed of this model has been increased by 10.1 times, ensuring excellent generation quality even when processing long videos.

After rigorous internal and public benchmark testing, LongCat-Video has shown excellent performance in multiple dimensions such as text alignment, visual quality, and motion quality, achieving SOTA (State of the Art) levels in the current open-source field. The team stated that the release of LongCat-Video will greatly simplify the process of creating long videos, allowing creators to jump from 1 second of inspiration to a 5-minute finished product.

To allow more people to experience this advanced technology, Meituan has released related resources of LongCat-Video on GitHub and Hugging Face. This project not only provides powerful tools for individual creators but also injects new vitality into the entire video creation industry.