Under the backdrop of rapid development in speech synthesis technology, Face Intelligent and the Human-Machine Speech Interaction Laboratory at the Shenzhen International Graduate School of Tsinghua University (THUHCSI) recently jointly released a new speech generation model - VoxCPM. This model, with a parameter size of 0.5B, is committed to providing users with a high-quality and natural speech synthesis experience.
The release of VoxCPM marks another milestone in the field of high-fidelity speech generation. The model has achieved industry-leading levels in key indicators such as naturalness, voice similarity, and prosody expression. Through zero-shot voice cloning technology, VoxCPM can generate unique user voices with minimal data, thus achieving personalized speech synthesis. This technological advancement brings more possibilities to the application scenarios of speech generation, especially in personalized voice assistants and character voice acting.
It is reported that VoxCPM is open-sourced on platforms such as GitHub and Hugging Face, and provides an online experience platform for developers, making it easy for users to explore and use its powerful features. The model performed excellently in the authoritative speech synthesis evaluation ranking Seed-TTS-EVAL, achieving extremely low error rates in word error rate and voice similarity, demonstrating its outstanding inference efficiency. On a single NVIDIA RTX4090 graphics card, VoxCPM's real-time factor (RTF) reached approximately 0.17, meeting the needs of high-quality real-time interaction.
VoxCPM not only breaks through in technical performance, but also excels in audio quality and emotional expression. The model can intelligently select appropriate voices, intonations, and prosodies based on the text content, simulating an auditory experience indistinguishable from that of a human. Whether it is weather reports, heroic speeches, or dialect hosts, VoxCPM can accurately reproduce them, providing an immersive auditory experience.
In addition, the technical architecture of VoxCPM is based on the latest diffusion autoregressive speech generation model, integrating hierarchical language modeling and continuous representations of local diffusion generation, significantly enhancing the expressiveness and naturalness of generated speech. The core architecture of this model includes multiple modules that work together, achieving an efficient "semantic-acoustic" generation process.
🔗 Github:
https://github.com/OpenBMB/VoxCPM/
🔗 Hugging Face:
https://huggingface.co/openbmb/VoxCPM-0.5B
🔗 ModelScope:
https://modelscope.cn/models/OpenBMB/VoxCPM-0.5B
🔗 PlayGround Experience:
https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo
🔗 Audio Sample Page Address:
https://openbmb.github.io/VoxCPM-demopage