Recently, Microsoft launched a highly anticipated open-source text-to-speech (TTS) model called VibeVoice, which has attracted significant attention in the AI voice technology field. With its powerful features and outstanding performance, this model has set a new benchmark for long-form speech generation, multi-person dialogue, and Chinese speech synthesis. Below, AIbase will provide you with a detailed analysis of the highlights and potential of VibeVoice.
Supports 90-minute ultra-long speech generation, breaking time limits
VibeVoice has made a major breakthrough in the duration of speech generation, capable of generating continuous speech up to 90 minutes at one time. This feature is particularly suitable for scenarios requiring long audio output, such as podcasts, audiobooks, and educational content production. Compared to the time limitations of traditional TTS models, VibeVoice's ultra-long generation capability offers content creators greater flexibility and creative space.
New heights in multi-person dialogue, supporting up to four people's voices
Different from previous TTS models that were limited to single or two-person dialogues, VibeVoice can generate smooth conversations involving up to four people. This feature performs exceptionally well in scenarios such as simulating multi-person podcasts, meeting recordings, or virtual character interactions. Thanks to its optimization in speech consistency and natural turn-taking, the multi-person dialogue generated by VibeVoice is natural and smooth, almost comparable to real human recordings.
Excellent Chinese speech effects, promoting localized applications
For the Chinese market, VibeVoice has demonstrated impressive performance. It supports Chinese speech synthesis and achieves a high level in tone, pronunciation accuracy, and naturalness. This makes VibeVoice have broad application potential in areas such as Chinese podcasts, education and training, and intelligent customer service, providing developers with a high-quality localized voice solution.
Supports background music, creating an immersive podcast experience
Another highlight of VibeVoice is its ability to generate podcasts with background music. This feature allows content creators to easily add background sound effects to their voice, creating more immersive and professional audio content. Whether it's a relaxing background melody or a tense atmosphere sound effect, VibeVoice can seamlessly integrate them, offering listeners a richer auditory experience.
Open source empowers developers, with broad future application prospects
As an open-source model, VibeVoice was officially released on GitHub on August 26, 2025, allowing developers to freely obtain and perform secondary development. Microsoft's move to open-source this model not only lowers the threshold for using high-quality TTS technology but also injects new vitality into the global AI developer community. Whether individual creators or enterprise users, they can quickly build innovative voice applications through VibeVoice.
Address: https://huggingface.co/microsoft/VibeVoice-1.5B