Scientists at Alibaba Group have introduced VACE, a versatile artificial intelligence model designed to unify the processing of a wide range of video generation and editing tasks.
At the heart of VACE is an enhanced diffusion Transformer architecture. Its innovation lies in a novel input format called the "Video Conditional Unit" (VCU). The VCU distills diverse modalities of input, such as text prompts, reference images or video sequences, and spatial masks, into a unified representation. A specialized mechanism coordinates these different inputs, preventing conflicts.
Concept Decoupling for Fine-Grained Control
VACE employs "concept decoupling" technology, segmenting images into editable and fixed regions. This allows for precise control over modified and preserved content. Visual information is divided into "active" and "inactive" areas via masking and embedded in a shared feature space, combined with text input. To ensure inter-frame consistency in videos, features are mapped to a latent space matching the diffusion Transformer structure. A temporal embedding layer helps the model understand temporal coherence in sequences, while attention mechanisms link features across different modalities and time steps.
VACE supports four core tasks: text-to-video generation, reference-based video synthesis, video-to-video editing, and mask-based object editing. Its applications are diverse, including person removal, animated character generation, object replacement, and background extension.
Model Training and Evaluation
The research team initially focused on drawings and doodles to support text-to-video generation, gradually incorporating reference images and moving towards more advanced editing tasks. Training data was sourced from internet videos, enhanced through automated filtering, segmentation, and deep/pose annotations. To evaluate VACE's performance, researchers created a benchmark comprising 480 cases covering 12 video editing tasks. Experimental results show that VACE outperforms dedicated open-source models in quantitative metrics and user studies. However, it still lags behind commercial models like Vidu and Kling in reference-to-video generation.
Alibaba researchers believe VACE represents a significant step towards a universal, multi-modal video model. Future development will involve expanding with larger datasets and greater computing power. Parts of the model's code will be open-sourced on GitHub. VACE, along with Alibaba's recently released series of large language models (like the Qwen series), forms part of its broader AI strategic layout. Other Chinese tech giants, including ByteDance, are also actively developing video AI technologies, with some achievements surpassing Western counterparts.