Revolutionary breakthrough quietly emerges in the field of AI video generation. Kuaishou's KlingAI recently launched a new digital human model, Avatar2.0. With just one person's photo and a music audio clip, users can instantly generate a 5-minute singing video. Digital humans are no longer stiff "lip-sync" puppets, but rather "performers" who naturally raise their eyebrows, smile with their eyes, and move their bodies to the rhythm. This upgraded model has been officially launched on the Kling platform, marking a leap from "static" to "dynamic storytelling" in AI content creation.

image.png

 Core Innovation: Intelligent Leap from Audio to Emotional Performance

The core of Avatar2.0 lies in its multimodal director module (MLLM Director), which integrates multimodal large language models (MLLMs) to convert three user inputs—image, audio, and text prompts—into a coherent storyline. Specifically, the system first extracts speech content and emotional trajectories from the audio, such as injecting "excitement" during upbeat melodies or synchronizing with drum beats during rap sections. At the same time, it identifies facial features and scene elements from a single photo and incorporates user text like "slow zoom up" or "arms moving rhythmically." Finally, by injecting text across attention layers into the video diffusion model, it generates a globally consistent "blueprint video," ensuring smooth rhythm and unified style throughout the entire content.

Compared to previous versions, Avatar2.0 has made a qualitative leap in expression control: emotions like smiling, anger, confusion, and emphasis appear naturally, avoiding the "facial paralysis" of early AI characters. The motion design is also more flexible, not limited to head lip-sync, but including full-body performances such as shoulder shrugs and gesture emphases that perfectly align with the music. Test benchmarks show that in 375 sample cases of "reference image–audio–text prompt," the model achieves a response accuracy of over 90% in complex singing scenarios, supporting real people, AI-generated images, and even animal or cartoon characters.

 Technical Support: High-quality Data and Two-stage Generation Framework

To achieve stable output of minute-long long videos, the Kuaishou Kling team built a rigorous training system. They collected thousands of hours of video from speech, dialogue, and singing corpora, using expert models to screen them based on multiple dimensions such as mouth clarity, audio-visual synchronization, and aesthetic quality, finally obtaining hundreds of hours of high-quality dataset after manual review. The generation framework adopts a two-stage design: the first stage plans the global semantics based on the blueprint video; the second stage extracts the first and last frames as conditions and generates sub-segment videos in parallel, ensuring identity consistency and dynamic coherence.

Additionally, Avatar2.0 supports ultra-high frame rates of 48fps and 1080p HD output, with animation smoothness far exceeding industry averages. Users can try the basic functions for free on the Kling platform (https://app.klingai.com/cn/ai-human/image/new), while advanced long videos require a subscription plan. Platform data shows that the number of generated videos increased by 300% on the first day of launch, with user feedback focusing on "emotional authenticity" and "easy operation."

 Application Prospects: Reshaping Short Video and Marketing Ecosystem

This model's implementation will deeply impact fields such as short videos, e-commerce advertising, and educational content. For example, podcast creators can transform pure audio into visual performance, instantly boosting the appeal on YouTube or Douyin; e-commerce sellers only need to upload product photos and audio explanations to generate multilingual demonstration videos, reducing costs to 1/10 of traditional shooting. Music enthusiasts can experiment with "virtual concerts": input melodies generated by Suno AI, and Avatar2.0 can make the digital human "sing" an emotionally engaging MV, even supporting multi-person interactive scenes.

In the global AI wave, KlingAI Avatar2.0 is not only a technological iteration but also a catalyst for creative democratization. It allows ordinary users to "direct" professional-level videos without barriers, foreshadowing a future where content production shifts from "labor-intensive" to "AI-powered." However, experts also remind that along with convenience come copyright and ethical challenges, such as compliance in using celebrities' faces.