Recently, Microsoft Research officially open-sourced its latest audio model — VibeVoice-1.5B. The model has achieved multiple major breakthroughs in speech synthesis technology, making the synthesized speech more natural, longer in duration, and of better quality.
VibeVoice-1.5B is capable of synthesizing ultra-long speech of up to 90 minutes in one go, which is rare in previous speech synthesis models. Previously, most models could only synthesize speech within 60 minutes, and they often experienced voice drift and semantic disconnection when exceeding 30 minutes. This model also supports up to four speakers, significantly improving the performance of multi-speaker synthesis, while previous open-source models could support at most two speakers. In addition, VibeVoice has achieved a compression rate of 3200 times for 24kHz raw audio, greatly improving compression efficiency while maintaining high-fidelity speech quality.
The core of the VibeVoice model lies in its unique dual tokenizer architecture. Unlike traditional TTS models that rely on a single tokenizer to extract features, VibeVoice innovatively introduces a collaborative mechanism between the acoustic tokenizer and the semantic tokenizer, solving the problem of mismatch between voice and semantics. The acoustic tokenizer focuses on preserving voice characteristics and achieving extreme compression, while the semantic tokenizer is responsible for extracting features consistent with the text semantics, ensuring that the emotional tone of the synthesized speech aligns with the text content.
In terms of training, VibeVoice adopts a curriculum learning strategy, gradually increasing the length of the input sequence to avoid training failures caused by processing ultra-long sequences. During the training process, the parameters of the acoustic tokenizer and the semantic tokenizer remain unchanged, ensuring the stability of the feature extraction module and thus shortening the training cycle.
The open-sourcing of VibeVoice-1.5B not only brings new technological breakthroughs to the field of speech synthesis but also lays the foundation for the release of larger parameter models in the future. For researchers and developers in audio processing and speech synthesis, this is an innovative development worth paying attention to.
Open source address: https://huggingface.co/microsoft/VibeVoice-1.5B
Online demo: https://aka.ms/VibeVoice-Demo
Key points:
🔊 The VibeVoice-1.5B model can synthesize ultra-long speech of up to 90 minutes in one go and supports up to four speakers.
💾 The model achieves a 3200 times audio compression rate while maintaining high-fidelity speech quality.
🤖 It uses a dual tokenizer architecture to solve the problem of mismatch between voice and semantics.