Recently, the Tencent ARC team released a model called AudioStory, aimed at generating long narrative audio using large language models (LLMs). The model addresses the advantages of existing text-to-audio generation technology in handling short audio while tackling the challenges of time coherence and compositional reasoning in long narrative audio.

image.png

The core of AudioStory lies in its unified understanding and generation framework. The model can handle various tasks such as video dubbing, audio continuation, and long narrative audio synthesis. By combining large language models with audio generation systems, AudioStory can generate structured and temporally coherent audio narratives. The model has strong instruction-following reasoning generation capabilities, capable of breaking down complex narrative queries into subtasks arranged in chronological order, while maintaining the continuity of scene transitions and the consistency of emotional tone.

image.png

AudioStory has two notable features: first, a decoupled bridging mechanism that effectively divides the collaboration between large language models and audio generators into two specialized parts; second, an end-to-end training approach that unifies instruction understanding and audio generation, enhancing the synergy between components.

In addition, the research team has established a benchmark dataset called AudioStory-10K, covering diverse fields such as animated soundscapes and natural sound narratives. Through extensive experiments, AudioStory outperforms previous text-to-audio generation models in both single-audio generation and narrative audio generation, demonstrating excellent instruction-following capabilities and audio quality.

Currently, the team has released the inference code for the model and showcased a series of demonstration videos, including a dubbing example for the classic animation "Tom and Jerry," as well as application cases of generating long audio based on text, demonstrating the model's wide applicability and powerful functionality.

Project: https://github.com/TencentARC/AudioStory

Key Points:  

🎧 **AudioStory is a long-form narrative audio generation model developed by Tencent ARC, combining large language models and audio generation technology.**  

📊 **The model has strong instruction-following capabilities, capable of generating coherent audio narratives and improving user experience.**  

🛠️ **The team has released inference code and demonstrated multiple application cases, showcasing its advantages in video dubbing and long audio generation.**