Microsoft has recently open-sourced a cutting-edge family of voice AI models called VibeVoice, which includes capabilities such as automatic speech recognition (ASR) and text-to-speech (TTS). The project has quickly gained attention in the developer community due to its powerful long audio processing, natural multi-speaker dialogue generation, and real-time low-latency features. It has already accumulated approximately 27K Stars on GitHub.
As an open-source research framework, VibeVoice is released under the MIT license, supports local deployment, and requires no cloud subscription fees, aiming to promote collaboration and innovation in the field of speech synthesis. The model family mainly consists of three core members, each with its own focus, collectively addressing pain points in traditional voice AI, such as long sequence processing, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Powerful Tool for Structured Speech-to-Text with Up to 60 Minutes
VibeVoice-ASR-7B is a unified speech-to-text model that can process audio files up to 60 minutes in length in one go, directly outputting structured transcriptions. The output includes "who is speaking" (speaker identification), "when it was spoken" (precise timestamps), and "what was said" (detailed content), and supports custom hot words, effectively improving the accuracy of recognizing proper nouns or technical terms. This model supports over 50 languages and is suitable for complex scenarios such as long meeting records and podcast transcription.
Community developers have already created practical tools based on this model, such as a voice input method called Vibing, which supports macOS and Windows platforms. User feedback shows that it performs well in terms of speed and accuracy, significantly improving daily voice input efficiency.
VibeVoice-TTS-1.5B: Expressive Speech Generation for Up to 90 Minutes with Multiple Speakers
VibeVoice-TTS-1.5B is the core model focused on text-to-speech, capable of producing continuous audio lasting up to 90 minutes in a single generation, supporting up to four different speakers for natural dialogue simulation. The generated speech is expressive and sounds natural and fluent, capable of simulating real pauses, emphasis, and emotional shifts, making it ideal for creating podcasts, long audio narratives, audiobooks, or multi-character dialogues.
Compared to many traditional TTS models that only support 1-2 speakers, VibeVoice-TTS has made significant breakthroughs in long-form and multi-speaker consistency. Its underlying design combines a continuous speech tokenizer (acoustic and semantic tokenizer) with a low frame rate (7.5Hz), significantly improving computational efficiency for long sequences.
VibeVoice-Realtime-0.5B: Real-Time TTS with Approximately 300 Milliseconds Latency
VibeVoice-Realtime-0.5B focuses on real-time scenarios, supporting streaming text input, with the first audio output delay of about 300 milliseconds, while also capable of generating long audio of up to 10 minutes. This model is particularly suitable for interactive applications requiring immediate responses, such as real-time voice assistants or live streaming dubbing scenarios.
In addition, the project introduced experimental speaker support, including multilingual speech and various English style variations, offering developers more customization options.
AIbase Review: Microsoft's open sourcing of VibeVoice not only lowers the entry barrier for high-performance voice AI but also provides a complete solution for local deployment. The project was temporarily taken down due to potential misuse risks, but it was re-launched after embedding audio watermarks and audible disclaimers as security mechanisms, reflecting the principles of responsible AI development. Currently, developers can obtain model weights from the GitHub repository and Hugging Face and quickly try them out via platforms like Colab.
With continued contributions from the open-source community (such as optimizations for Apple Silicon), VibeVoice is expected to accelerate its implementation in areas such as content creation, accessibility tools, and voice interaction. Interested developers can visit Microsoft's official project page for further exploration.
Project Address: https://github.com/microsoft/VibeVoice

