Qwen Releases OmniAudio, Which Can Generate Spatial Audio from 360-Degree Videos

AIbase基地

Published inAI News · 6 min read · May 29, 2025

Recently, the Speech Team at Qwen Lab has achieved a milestone in spatial audio generation by introducing the OmniAudio technology, which can directly generate First-order Ambisonics (FOA) audio from 360° videos, opening up new possibilities for virtual reality and immersive entertainment.

As a technology that simulates real auditory environments, spatial audio enhances immersive experiences. However, existing technologies are mostly based on fixed-perspective videos and fail to fully utilize the spatial information of 360° panoramic videos. Traditional video-to-audio generation techniques mainly produce non-spatial audio, failing to meet the demands of immersive experiences for 3D sound localization. Moreover, they are often based on limited-perspective videos, missing out on the rich visual context of panoramic videos. With the increasing popularity of 360° cameras and advancements in virtual reality technology, generating matched spatial audio from panoramic videos has become an urgent issue to address.

To tackle these challenges, Qwen Lab proposed the 360V2SA (360-degree Video to Spatial Audio) task. FOA is a standard 3D spatial audio format represented by four channels (W, X, Y, Z), capable of capturing sound directionality and achieving realistic 3D audio reproduction, while maintaining accurate sound localization during head rotations.

Data is the cornerstone of machine learning models, but paired datasets of 360° videos and spatial audio are scarce. To address this, the research team carefully constructed the Sphere360 dataset, containing over 103,000 real-world video clips, covering 288 types of audio events, with a total duration of 288 hours. It includes both 360° visual content and supports FOA audio. During construction, the team adopted rigorous screening and cleaning standards, using various algorithms to ensure high-quality alignment.

The training method for OmniAudio is divided into two stages. In the first stage, self-supervised coarse-to-fine flow matching pretraining is used. The team fully utilized large-scale non-spatial audio resources, converting stereo audio into "pseudo-FOA" format before sending it into a four-channel VAE encoder to obtain latent representations. Random time-window masking was applied with a certain probability, treating the masked potential sequence and the complete sequence as conditional inputs to the flow-matching model, enabling self-supervised learning of audio timing and structure. This allowed the model to master general audio features and macro temporal rules. In the second stage, supervised fine-tuning based on dual-branch video representation was conducted. The team only used real FOA audio data, continuing to use the masked flow-matching training framework to strengthen the model's ability to represent sound source directions and improve the reconstruction of high-fidelity spatial audio details. After completing the self-supervised pretraining, the team combined the model with the dual-branch video encoder for supervised fine-tuning, precisely "carving" out FOA latent trajectories that match visual cues from noise, outputting four-channel spatial audio that is highly aligned with 360° videos and has precise directional sense.

In the experimental setup, the research team performed supervised fine-tuning and evaluation on the Sphere360-Bench and YT360-Test test sets, using objective and subjective metrics to measure the quality of the generated audio. The results showed that OmniAudio significantly outperformed all baselines on both test sets. On the YT360-Test set, OmniAudio greatly reduced values in FD, KL, and ΔAngular indicators; it also achieved excellent performance on the Sphere360-Bench. In human subjective assessments, OmniAudio scored much higher than the optimal baseline in terms of spatial audio quality and visio-audio alignment, demonstrating its superiority in clarity, spatial sense, and synchronization with visuals. Additionally, ablation experiments verified the contributions of the pretraining strategy, dual-branch design, and model scale to performance improvement.

Project Homepage

https://omniaudio-360v2sa.github.io/

Code and Data Open Source Repository

https://github.com/liuhuadai/OmniAudio

Paper Link

https://arxiv.org/abs/2504.14906

Silicon Base Flow Launches Powerful Coding Model Kimi K2 to Promote Smart Application Development

The Silicon Base Flow platform has launched the open-source MoE model Kimi K2 developed by Moonshot AI. The model has a total of 1T parameters and 32B activated parameters, supports a context length of 128K, and performs excellently in coding and agent tasks. The pricing is 4 yuan per million tokens for input and 16 yuan per million tokens for output. New users can get 14 yuan in trial credit upon registration. The model has three technical advantages: 15.5T tokens of large-scale training, MuonClip optimizer for stable expansion, and design optimized for agent tasks. Tests show that it excels in coding

A Daily: Moonlight Open-Sources Large Model Kimi K2; Zhiyuan Fully Open-Sources RoboBrain 2.0; Tongyi Qianwen Launches Qwen Chat Desktop Client

Moon's dark side opens trillion-parameter Kimi K2 model; RoboBrain2.0 enhances robot cognition; Alibaba's Qwen adds image generation; IndexTTS2 revolutionizes voice cloning; HuggingFace's Reachy Mini sells well; Meta enables real-time video generation; PixVerse adds multi-keyframe; Tesla Grok supports AMD only; OpenAI delays open-source release; Liquid AI's LFM2 boosts edge AI; AI 'time travel' trend goes viral.....

Liquid AI Opensources LFM2: The New King of Edge AI, Achieving Breakthroughs in Speed and Efficiency!

Liquid AI opensources the next-generation edge AI model LFM2, available in three versions with 350M to 1.2B parameters. The model features an innovative architecture, achieving twice the inference speed and three times the training efficiency on edge devices, supporting 32K long context processing. LFM2 performs exceptionally well in tasks such as instruction following, outperforming models of similar scale, making it particularly suitable for privacy-sensitive scenarios. Fully open-sourced through Hugging Face, this marks the first time a U.S. company has surpassed Chinese open-source models in the field of efficient small models. Liquid AI

China's AI Governance Plan Shines at the UN Summit, Beating Over 60% of Deepfake Attacks

The UN AI for Good Summit was held in Geneva, where Peng Jin from Ant Group shared China's achievements in AI security technology. Data shows that Ant Digital helped Southeast Asian banks reduce fake face attack rates from 10% to 4%, with an identification accuracy rate of 99.9%. Ant provides financial-grade identity authentication through the ZOLOZ platform, serving 25 countries, and has opened a dataset of 1.8 million fake samples to promote industry research. China's technological solutions are offering important references for global AI safety governance.

New AI Time Travel Gameplay is Trending! See What a 12-Year-Old Looks Like at 23?

AI's 'time travel' trend thrives as ChatGPT transforms childhood photos. TikTok's AI effect drew 170K users, but results vary: Musk's image was unrecognizable, Asian stars distorted, while Eddie Peng fared slightly better. Experts note AI predicts general trends, not individual changes, yet this playful tech sparks social media buzz.....

Big Investment from SpaceX! Will Inject $2 Billion into xAI

SpaceX plans to invest $2B in Musk's AI firm xAI, marking its first direct investment. This aligns with xAI's $5B equity funding round. SpaceX uses xAI's chatbot Grok for Starlink and aims to deepen collaboration. Despite Grok's controversies, Tesla plans to integrate it, showcasing Musk's strategy of inter-company synergy to advance AI.....

United Nations affiliated organizations launch AI refugee virtual characters to enhance public awareness of refugee issues

The United Nations University research team developed two AI virtual characters - Amina, a Sudanese refugee, and Abdullah, a rebel fighter - to raise public awareness of the refugee crisis through dialogue. The project was experimentally conducted by an academic team and is not an official United Nations project. Although researchers envisioned using it for fundraising presentations, test users provided negative feedback, stating that real refugees can already speak for themselves. The relevant website is currently unavailable.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Qwen Releases OmniAudio, Which Can Generate Spatial Audio from 360-Degree Videos

AIbase基地

This article is from AIbase Daily

AI News Recommendations