At a time when AI video generation is booming, "silent with pictures" or "sound mismatch" has always been the last barrier affecting immersion. To address this pain point, Alibaba Tongyi Lab recently introduced a new video-to-audio framework called PrismAudio. This research has been accepted by the top AI conference ICLR 2026, with the core aim of automatically matching videos with precise ambient sound effects.

image.png

Think First, Then Speak: The Master of Voice Acting with "Chain of Thought"

Traditional voice acting models usually generate sounds in an "intuitive" way, often resulting in awkward situations such as a horse stepping on the ground but making a bird call, or the sound lagging half a beat behind the visuals. The breakthrough of PrismAudio lies in its ability to "take notes first, then speak."

  • Decomposition Chain of Thought: Before generating sounds, the model analyzes the video content: What is in the scene? When should the sound start? Is the audio crisp or deep? Is the sound source on the left or right?

  • Four Teachers Scoring: To ensure quality, the development team introduced reinforcement learning, where four "virtual teachers" score the output from four dimensions: semantic consistency, temporal synchronization, aesthetic quality, and spatial accuracy. This multi-dimensional feedback mechanism solves the long-standing problem of previous models "focusing on one aspect while neglecting others."

Lightweight and Efficient: 0.6 Seconds for a 9-Second Video

Not only does PrismAudio produce accurate sounds, but it also runs extremely fast. Thanks to its self-developed Fast-GRPO efficient training algorithm, the model achieves a significant performance leap while maintaining high operational efficiency:

  • Small Size, Big Power: The model has only 518 million parameters, far fewer than similar models that typically have tens of billions of parameters.

  • Ultra-Fast Response: Generating a 9-second high-quality audio takes only 0.63 seconds, almost achieving "instant delivery."

Industry Insight: The Era of Authentic Environmental Sound Effects

The emergence of PrismAudio not only provides a powerful automation tool for film post-production and short video creation, but also offers new ideas for multi-object generation tasks. When AI can accurately balance the texture and spatial sense of sound, future video creation will truly achieve "what you see is what you hear."

Paper link: arXiv:2511.18833

Open source link: https://prismaudio-project.github.io/