Google DeepMind has introduced a video-to-audio technology called V2A. This technology leverages video pixels and text prompts to generate rich audio tracks, creating soundtracks for silent videos and achieving synchronized audio-visual generation.
Product Entry:https://top.aibase.com/tool/deepmind-v2a
Users can guide audio output by specifying "positive prompts" or "negative prompts" to precisely control the creation of audio tracks. The V2A system employs autoregressive and diffusion methods to generate audio, ensuring synchronized and realistic audio output. During training, the system utilizes AI-generated annotations to help the model understand the relationship between specific audio events and visual scenes.
Operating Principle:
The V2A system first encodes the video input into a compressed representation. Then, a diffusion model iteratively refines audio from random noise. This process is guided by visual input and natural language prompts to generate synchronized, realistic audio that closely matches the prompts. Finally, the audio output is decoded into audio waveforms and combined with the video data.
The V2A system diagram shows how video pixels and audio prompts are used to generate audio waveforms synchronized with the underlying video. Initially, V2A encodes the video and audio prompt inputs and runs them iteratively through a diffusion model. It then generates compressed audio and decodes it into audio waveforms.
To produce higher quality audio and enhance the model's ability to generate specific sounds, additional information, including AI-generated annotations with detailed sound descriptions and verbal dialogue records, is added during training.
By training on videos, audio, and additional annotations, the technology learns to associate specific audio events with various visual scenes while responding to the information provided in the annotations or records.
V2A Features:
Audio Generation: V2A automatically generates synchronized audio tracks based on video footage and user-provided text descriptions, including dramatic soundtracks, realistic sound effects, or dialogue that matches the video's characters and tone.
Synchronized Audio: Using autoregressive and diffusion methods, V2A ensures that the generated audio is perfectly synchronized with the video content, producing realistic audio output.
Diverse Audio Tracks: Users can generate an unlimited number of audio tracks, experimenting with different sound combinations to find the perfect fit for their video content.
Prompt Control: Users can guide audio track generation by defining "positive prompts" or "negative prompts," increasing control over the output and steering it away from unwanted sounds.
Training with Annotations: During training, the system uses AI-generated annotations to help the model understand the relationship between specific audio events and visual scenes.
To improve audio generation quality, the research team introduced more information during training, such as AI-generated annotations with sound descriptions and verbal dialogue records. This enriched information training enables the technology to better understand video content and produce audio effects that match the visual scenes.
However, there are still challenges, particularly with lip synchronization for videos involving speech. V2A attempts to generate speech based on input transcriptions and synchronize it with the character's lip movements. However, the video generation model may not be conditioned on the transcription text, leading to mismatches and often resulting in strange lip synchronization, as the video model does not generate mouth movements that match the transcription text.
Before being made available to the public, the V2A technology will undergo rigorous safety assessments and testing. Below are some dubbing examples generated by V2A:
1. Audio Prompt: Wolf howling at the moon
2. Audio Prompt: Movie, thriller, horror, music, tension, atmosphere, footsteps on concrete
3. Audio Prompt: Drummer on a concert stage surrounded by flickering lights and a cheering crowd
Audio Prompt: Cute little dinosaur chirping, jungle atmosphere, egg cracking
Note: The videos in this article are from official Google examples.