MOSS-TTSD (Text to Spoken Dialogue), developed by the Tsinghua University Speech and Language Laboratory (Tencent AI Lab) in collaboration with Shanghai Chuangzhi College, Fudan University, and Musi Intelligent, has been officially open-sourced. This marks a major breakthrough in AI speech synthesis technology for dialogue scenarios.
This speech dialogue generation model is based on the Qwen3-1.7B-base model and is trained further using approximately 1 million hours of single-speaker voice data and 400,000 hours of dialog voice data. It uses a discrete speech sequence modeling method to achieve high expressive spoken dialogue generation in both Chinese and English, making it particularly suitable for long-form content creation such as AI podcasts, audiobooks, and film and television dubbing.
The core innovation of MOSS-TTSD is its XY-Tokenizer, which adopts a two-stage multi-task learning approach. By using eight RVQ codebooks, it compresses the speech signal to a bitrate of 1 kbps while preserving semantic and acoustic information, ensuring naturalness and fluency in the generated speech. The model supports ultra-long speech generation of up to 960 seconds, avoiding unnatural transitions caused by segment stitching in traditional TTS models. Additionally, MOSS-TTSD has zero-shot voice cloning capabilities, enabling two-person voice cloning by uploading complete dialogues or single-person audio, and supports voice event control, such as laughter, adding more expressiveness to the speech.
Compared to other voice models in the market, MOSS-TTSD significantly outperforms the open-source model MoonCast in objective Chinese metrics, with excellent prosody and naturalness. However, compared to ByteDance's Douba voice model, it slightly lags in tone and rhythm. Nevertheless, with the advantages of being open-source and free for commercial use, MOSS-TTSD still shows strong application potential. Model weights, inference code, and API interfaces are fully open-sourced via GitHub (https://github.com/OpenMOSS/MOSS-TTSD) and HuggingFace (https://huggingface.co/fnlp/MOSS-TTSD-v0.5). Official documentation and online demo experiences are also available, providing developers with convenient access.
The release of MOSS-TTSD brings new vitality to the field of AI speech interaction, especially in scenarios such as long interviews, podcast production, and film and television dubbing, where its stability and expressiveness will drive the intelligent process of content creation. In the future, the team plans to further optimize the model, enhancing the accuracy of speech switching and emotional expression in multi-speaker scenarios.
Address: https://github.com/OpenMOSS/MOSS-TTSD