Xiaohongshu ZhiChuang Audio Technology Team recently launched the next-generation dialogue synthesis model FireRedTTS-2, marking another significant advancement in dialogue generation technology. The model aims to address some pain points in existing dialogue synthesis solutions, such as poor flexibility, frequent pronunciation errors, unstable speaker switching, and insufficient prosody naturalness.

image.png

FireRedTTS-2 upgrades its core modules, especially the discrete speech encoder and text-to-speech synthesis model, to comprehensively improve the synthesis effect. In multiple objective and subjective evaluations, FireRedTTS-2 has shown industry-leading performance, providing a better solution for multi-speaker dialogue synthesis. Its technical report has been published on arXiv and can be experienced through a dedicated Demo and code link.

A notable feature of FireRedTTS-2 is its naturalness in synthesis. The model can accurately capture details such as stress, emotion, and pauses, resulting in natural and smooth audio quality. Compared to closed-source dialogue generation models, FireRedTTS-2 not only can generate high-quality podcast audio but also supports voice cloning. By providing just one sentence of speech sample from each speaker, the model can imitate their voice and speaking habits to automatically generate entire dialogues. This function makes it highly competitive in the open-source dialogue generation field.

During training, FireRedTTS-2 supports multiple languages, including Chinese, English, Japanese, Korean, and French. It also uses a low-frame-rate discrete speech encoder to improve synthesis speed and stability. Additionally, the dual Transformer model architecture makes the synthesized speech more natural and coherent. Moreover, FireRedTTS-2 requires only a small amount of data to achieve voice customization, quickly adapting to different application scenarios.

The release of FireRedTTS-2 not only provides an industrial-grade solution for AI podcasts and dialogue synthesis applications, but also opens up new possibilities for innovation inside and outside the industry. In the future, the team will continue to optimize the model, increase the number of supported speakers and languages, and explore more controllable sound effects insertion features to meet growing market demands.

  • Code link: https://github.com/FireRedTeam/FireRedTTS2 

Key Points:

🎤 FireRedTTS-2 is the next-generation dialogue synthesis model launched by Xiaohongshu ZhiChuang Audio Technology Team, aiming to enhance synthesis quality and naturalness.  

🗣️ The model has voice cloning capabilities, generating natural multi-speaker dialogues with only a small amount of samples.  

🌐 Supports multiple languages and low-frame-rate discrete speech encoders, improving synthesis speed and stability, suitable for various application scenarios.