The AI Voice Revolution Has Arrived! Tencent's Secret Technology Makes Machines Become Top Storytellers in a Flash, One Sentence Generates Hollywood-Level Sound Effects

AIbase基地

Published inAI News · 7 min read · Aug 29, 2025

Another shock in the tech world! The AudioStory technology recently released by Tencent ARC Lab has completely revolutionized our understanding of AI audio generation. This is no longer just about "calling out a cat sound" or "playing a raindrop sound," but rather, it's making machines truly learn the art of storytelling.

When you casually say, "Mystery chase: footsteps splashing in water, thunder roaring, car skidding, and a door slamming shut," AudioStory can instantly create a cinematic-level audio feast for you. This ability was previously unimaginable, as traditional AI models were like musicians who could only play a single instrument, unable to handle the complex arrangement of an entire symphony.

The emergence of AudioStory is precisely to conquer this seemingly impossible task. The research team at Tencent ARC Lab, including top scientists such as Yuxin Guo, Teng Wang, and Yuying Ge, cleverly integrated large language models with text-to-audio systems, creating a super brain specialized in long-form narrative audio generation.

The core weapon of this system is the "divide and conquer" strategy. When faced with complex story descriptions, AudioStory first plays the role of the "rational brain" of a multimodal large language model, breaking down the entire narrative into a series of ordered audio events. For example, the chase scene would be accurately broken down into: footstep splash sounds creating a tense atmosphere, thunder roaring adding pressure, car skidding creating a crisis climax, and the door closing marking the end of the chase. Each event comes with detailed time, emotion, and scene instructions.

Even more astonishing is AudioStory's "decoupled connection mechanism." Traditional models are like two people speaking different languages trying to communicate, with only a clumsy translator in between. AudioStory, however, designs a precise "bilingual bridge": semantic tokens convey the macro meaning of the story, while residual tokens specifically capture subtle audio textures. When rain needs to show a change from fine to intense, or when thunder needs to gradually approach from afar, these subtle layers can be perfectly reproduced.

The training process is also ingeniously designed, using a three-stage progressive strategy. The first stage allows the model to master basic single audio generation capabilities, the second stage develops the model's ability to understand and generate audio collaboratively, and the third stage is the ultimate challenge—unified processing of long-form narrative audio. This step-by-step approach ensures that the model maintains high audio quality while demonstrating strong narrative skills when facing complex tasks.

Test results are equally impressive. The research team specially built the AudioStory-10K benchmark dataset, containing ten thousand meticulously annotated narrative audio samples, ranging from real natural sounds to cartoon animation sound effects. In front of this "ultimate exam," AudioStory demonstrated overwhelming strength: its instruction following capability is 17.85% higher than competitors, audio quality and duration matching is leading across the board, and most importantly, the indicators of consistency and coherence show excellent performance.

The application prospects are also exciting. The video dubbing feature allows AI to instantly become a professional film score composer. Just upload a silent video and describe the desired sound effect style, and AudioStory can automatically analyze the video content and generate background tracks that are completely synchronized and stylistically consistent. The audio continuation feature is even more imaginative. Given a coach's voice during a basketball training session, it can intelligently infer the subsequent scenes and automatically add reasonable audio continuations such as player footsteps and basketball bouncing sounds.

The significance of AudioStory goes beyond the technical breakthrough itself. It paves the way for application fields such as AI audiobooks, smart podcasts, and immersive game sound effects, allowing machines to truly possess the artistic literacy of a "storyteller." When AI can transform text, images, or even short audio clips into emotionally rich audio epics, just like an experienced voice director, we are witnessing a major leap forward in artificial intelligence towards a more humanized and artistic direction.

The birth of this technology marks the beginning of a new era in the field of text-to-audio. From simple sound imitation to complex narrative weaving, AudioStory proves through its strength the infinite potential of AI in creative expression.

Paper link: https://arxiv.org/pdf/2508.20088

Aliyun BaiLian Launches Memory Library Feature: Supports Cross-Session Memory Retrieval, Performance Improved by 50%

Alibaba Cloud's Bailian platform introduces a 'Memory Bank' feature to address AI Agent's memory loss in multi-turn dialogues, providing long-term cross-session memory. It's temporarily free, supports API calls or one-click deployment, and includes extraction, storage, retrieval, and injection modules for intelligent memory management.....

Qwen 3.6 Officially Released: 1 Million Long Context, Competing with Claude Code

Alibaba released the new generation large language model Qwen3.6-Plus, which is hailed as the strongest domestic programming model at present. Compared to the 3.5 version, its performance has been significantly improved, ranking first among domestic models in multiple programming evaluations, and its overall capabilities are close to the international benchmark Claude series. The model demonstrates a high level of autonomy in front-end development, complex repository tasks, and other areas.

Ali Qwen 3.6 Plus Preview Edition Lands on OpenRouter with Free Access for 1 Million Context

The latest model in the Ali Qwen series, Qwen3.6Plus Preview, has been launched on the OpenRouter platform, currently available for free with a 1 million long context processing capability. The model features deep optimization at the underlying architecture level, using a more advanced hybrid architecture, significantly improving scalability and reducing costs, with better performance than its predecessor.

How Long Until Robots Become Popular? Wang Xing: The ChatGPT Moment for Embodied Intelligence Will Take at Least Two to Three More Years

Yushu Technology founder Wang Xingxing stated at the Yabuli Forum that the 'ChatGPT moment' for embodied AI is still two to three years away, marked by robots' AI models achieving human-level performance in about 80% of unfamiliar tasks. Current technology still requires breakthroughs, and the true technological singularity remains distant.....

Musk Likes Kimi's Attention Residuals Research, Long-Text Large Model Architecture Sees New Breakthrough

Kimi Company released a paper titled "Attention Residuals: Rethinking Depth-Wise Aggregation", proposing a new method of attention residuals to optimize the depth-wise aggregation mechanism. Elon Musk, CEO of Tesla, liked the research on social media, calling it an outstanding work. Kimi official humorously responded, sparking a global discussion in the AI community.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

The AI Voice Revolution Has Arrived! Tencent's Secret Technology Makes Machines Become Top Storytellers in a Flash, One Sentence Generates Hollywood-Level Sound Effects

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Large Models Also Have an 8-Hour Workday! Zhipu GLM-5.1 Released: Long-Range Task Capabilities Exceed Opus 4.6 for the First Time

Aliyun BaiLian Launches Memory Library Feature: Supports Cross-Session Memory Retrieval, Performance Improved by 50%

Didi Qingming Data Exposed: Long-Distance Rides Surged by 41%, AI Ride Requests Increased by 37 Times

Qwen 3.6 Officially Released: 1 Million Long Context, Competing with Claude Code

Ali Qwen 3.6 Plus Preview Edition Lands on OpenRouter with Free Access for 1 Million Context

Musicians to Lose Their Jobs? Google DeepMind Launches Lyria 3 Pro: AI Can Now Independently Arrange Complete Long Gold Songs

Anthropic Releases Economic Impact Report: AI Has Not Caused Widespread Unemployment, But Entry-Level Positions Face Long-Term Risks

How Long Until Robots Become Popular? Wang Xing: The ChatGPT Moment for Embodied Intelligence Will Take at Least Two to Three More Years

Musk Likes Kimi's Attention Residuals Research, Long-Text Large Model Architecture Sees New Breakthrough

Xiaomi's First Agent Product Xiaomi miclaw Begins Internal Testing: Features Long-Term Memory and Edge-Cloud Private Computation

AI News Recommendations

Large Models Also Have an 8-Hour Workday! Zhipu GLM-5.1 Released: Long-Range Task Capabilities Exceed Opus 4.6 for the First Time

Aliyun BaiLian Launches Memory Library Feature: Supports Cross-Session Memory Retrieval, Performance Improved by 50%

Didi Qingming Data Exposed: Long-Distance Rides Surged by 41%, AI Ride Requests Increased by 37 Times

Qwen 3.6 Officially Released: 1 Million Long Context, Competing with Claude Code

Ali Qwen 3.6 Plus Preview Edition Lands on OpenRouter with Free Access for 1 Million Context

Musicians to Lose Their Jobs? Google DeepMind Launches Lyria 3 Pro: AI Can Now Independently Arrange Complete Long Gold Songs

Anthropic Releases Economic Impact Report: AI Has Not Caused Widespread Unemployment, But Entry-Level Positions Face Long-Term Risks

How Long Until Robots Become Popular? Wang Xing: The ChatGPT Moment for Embodied Intelligence Will Take at Least Two to Three More Years

Musk Likes Kimi's Attention Residuals Research, Long-Text Large Model Architecture Sees New Breakthrough

Xiaomi's First Agent Product Xiaomi miclaw Begins Internal Testing: Features Long-Term Memory and Edge-Cloud Private Computation

GEO Services