The intelligent creation team of ByteDance has jointly launched an open-source framework called HuMo with Tsinghua University, aiming to advance the field of Human-Centric Video Generation (HCVG). The HuMo framework has powerful multimodal input processing capabilities, allowing it to simultaneously utilize text, images, and audio information to collaboratively generate high-quality videos.

The name "Human-Modal" of HuMo accurately reflects its focus on humans and their activities. The success of this framework lies in its construction of a high-quality dataset and the innovative use of a progressive training method. This training approach enables HuMo to outperform existing specialized methods in various subtasks, generating videos with resolutions up to 480P and 720P, with a maximum length of 97 frames, outputting controllable character videos at 25 frames per second.

image.png

The core advantages of the framework lie in its innovative data processing workflow, flexible inference strategies, and progressive multimodal training approach. The combination of these technologies not only improves the quality of generated videos but also enhances processing speed, making HuMo perform more outstandingly in practical applications.

For developers and researchers, HuMo is not just a new tool but also a flexible solution that can meet the needs of different scenarios. The open-source address of the project allows more people to participate in the research and application of this technology, exploring new possibilities for future multimodal video generation.

Paper address: https://arxiv.org/pdf/2509.08519