MOSS-Speech Open Source: China's First Speech-to-Speech Large Model, Bypassing Text Intermediate
The MOSS team from Fudan University released MOSS-Speech, which realizes end-to-end speech dialogue for the first time. The model is now available and open-sourced on Hugging Face. It adopts a 'layer splitting' architecture, freezing the original text model and adding new layers for speech understanding, semantic alignment, and vocoder. It can complete speech Q&A, emotional imitation, and laughter generation in one step, without the traditional three-step process. Evaluation results show that the word error rate has been reduced to 4.1% in the ZeroSpeech2025 task, and the emotion recognition accuracy reached 91.2%.