Meta has officially launched a major breakthrough in audio processing - SAM Audio, the world's first unified multimodal audio separation model. It allows users to "hear with their eyes," extracting any target sound from a mixed video or audio with one click: click on the guitarist in the video, and instantly separate the clean guitar sound; enter "dog bark," and automatically filter out dog noises from an entire podcast; even by selecting a time segment, you can accurately remove interfering sounds. This technology is the first to fully replicate the way humans naturally perceive sound - seeing, speaking, pointing, and selecting - into an AI system.

The core of SAM Audio is its self-developed Perceptual Encoder Audio-Visual (PE-AV), which Meta calls the model's "ear." This engine was developed based on the open-source Meta Perception Encoder computer vision model released in April this year, and it is the first to integrate advanced visual understanding capabilities with audio signals, achieving cross-modal sound localization and separation.

Specifically, SAM Audio supports three intuitive interaction methods, which can be used individually or in combination:

- Text prompts: Enter semantic descriptions such as "vocal singing" or "car horn," and the system automatically extracts the corresponding sound source;

- Visual prompts: Click on the sound source in the video (such as a person speaking or hands drumming), and the system will separate the audio;

- Time segment prompts (industry-first): Mark the time interval when the target sound appears (e.g., "3 minutes 12 seconds to 3 minutes 18 seconds"), and the model automatically processes similar sounds in the entire recording - Meta compares this feature to the "dream" technology in Cyberpunk 2077.

To promote technological standardization, Meta has also open-sourced two key tools:

- SAM Audio-Bench: The first audio separation evaluation benchmark based on real-world scenarios;

- SAM Audio Judge: The world's first automatic evaluation model specifically for audio separation quality, capable of quantitatively assessing the purity and completeness of the separation results.

This newly released PE-AV is not only the underlying engine of SAM Audio but will also empower other Meta AI products, including subtitle generation, video understanding, and intelligent editing systems. Its open-source nature means developers can build their own "synesthetic" AI applications in the future - from automatic noise reduction in meeting recordings to immersive AR audio interactions, and even assistive auditory devices for accessibility.

In today's era of explosive growth in video content, the release of SAM Audio marks the beginning of a new era in audio processing: interactive, editable, and understandable. In the past, we could only passively receive sound; now, Meta gives us the superpower of "selective listening" - and this may just be the first step in multimodal AI reshaping sensory experiences.