China University of Science and Technology and ByteDance are about to jointly launch a milestone end-to-end long video generation model. This model can directly generate high-quality videos with **minutes-long duration, 480p resolution, and 24 frames per second (fps)**, and supports multi-shot scene transitions, marking a key breakthrough in domestic video generation technology in the global generative AI competition.
The core innovation of this achievement lies in its underlying algorithm - MoGA (Modular Global Attention), a new attention mechanism specifically designed to address context expansion and computational cost issues in long video generation. With the structural optimization of MoGA, the model can process up to 580K token of context information, significantly reducing computational costs, making it possible to generate long-duration, multi-scene videos.
The research team stated that traditional video generation models are often limited by memory and computing power, capable of generating only a few seconds of animated GIFs or short films. The introduction of MoGA allows the model to "generate in one go" a "mini short film" with multiple scene transitions and coherent visual storytelling, greatly expanding the application boundaries of generative video models.
Additionally, MoGA has a high level of modularity and compatibility, and can be directly integrated with existing efficient acceleration libraries (such as FlashAttention, xFormers, DeepSpeed, etc.), achieving faster training and inference efficiency. This means that the technology not only has scientific and technological breakthrough significance but also has the potential for industrial application, and can be applied in fields such as film and television creation, advertisement generation, game cutscenes, and digital human content production.
With companies such as OpenAI, Pika, and Runway continuously advancing short video generation, the model introduced by China University of Science and Technology and ByteDance is considered to be the first system in China capable of truly generating minute-long long videos. Its leadership in algorithm, efficiency, and scalability may push China into the global forefront in the field of video generation.
Address: https://jiawn-creator.github.io/mixture-of-groups-attention/