Recently, the Index team of Bilibili (B站) announced the full open-source release of its self-developed text-to-speech (TTS) system - IndexTTS-2.0. This system features controllable emotions and adjustable duration, marking an important step forward in the practical application of zero-shot TTS technology.

image.png

In the field of speech synthesis, controlling duration and expressing emotion have always been technical challenges in the industry. To overcome these issues, IndexTTS-2.0 introduced two core innovations: first, the time encoding mechanism. This mechanism was first applied in the autoregressive TTS architecture, greatly improving the accuracy of speech duration control, making the generated speech more stable and natural, and allowing precise control over the rhythm of speech. Second, the disentangled modeling of voice and emotion. The system uses an innovative disentangled modeling approach, allowing users to choose from various emotional adjustment methods, including a single audio reference, independent emotional reference audio, emotional vectors, and text descriptions. This flexibility significantly enhances the expressiveness of synthesized speech, meeting users' diverse needs for emotional expression.

According to official examples, IndexTTS-2.0 can be widely applied in AI dubbing, audiobooks, animated comics, video translation, voice dialogue, and podcast production, expanding the boundaries of speech synthesis technology. Especially in terms of global content export, IndexTTS-2.0 provides important technical support, enabling cross-language videos to achieve a near "difference-free" localized experience. Whether it's Chinese users watching foreign content or overseas users watching Chinese videos, they can enjoy a more natural and immersive auditory experience while preserving the original voice style and emotion. This technological breakthrough reduces the barriers for high-quality content to spread across languages, providing a solid foundation for the global implementation of AIGC technology.

Currently, the project paper, complete code, model weights, and online demo page of IndexTTS-2.0 have been released simultaneously. The IndexTTS team stated that they will continue to optimize model performance and collaborate with the developer community to promote the construction of a voice technology ecosystem for multilingual communication and global cultural connectivity.

Online demo address:

https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo

Key points:

🌟 B站's IndexTTS-2.0 system is fully open-sourced and has the functions of controllable emotions and adjustable duration.  

🕒 The system introduces a time encoding mechanism and disentangled modeling, enhancing the naturalness and expressiveness of speech synthesis.  

🌍 The system provides technical support for global content export, offering a better localized experience for cross-language videos.