In the rapidly developing field of artificial intelligence, the progress of domestic large models is astonishing. As early as this year, DeepSeek-R1 outperformed OpenAI at an ultra-low cost, prompting a reevaluation of the monopolistic position of foreign large models. Now, MiniMax has once again brought significant news: its new-generation text-to-speech (TTS) model, "Speech-02," has topped the prestigious Artificial Analysis speech evaluation list, defeating industry giants like OpenAI and ElevenLabs.
Spearhead's excellent performance is reflected in multiple key metrics, such as Word Error Rate (WER) and Speaker Similarity (SIM), achieving new state-of-the-art results (SOTA). This has shocked foreign netizens, who have praised MiniMax as a game-changer in the audio field. More surprisingly, the cost of Speech-02 is only one-fourth that of ElevenLabs' competitive products, showcasing its high cost-effectiveness.
So, why has Speech-02 achieved such impressive results? Behind it are two key technological innovations. On one hand, Speech-02 has achieved true zero-shot voice cloning. This means that given a reference audio segment, without any additional text, the model can quickly generate audio highly similar to the target voice. This technology significantly saves time and resources; previous synthesis methods usually required a large amount of sample data.
On the other hand, MiniMax adopted a new Flow-VAE architecture, which enhances the information representation capability during the speech generation process, thereby improving the quality and similarity of synthesized audio. By introducing a learnable speaker encoder, Speech-02 can focus on the unique pronunciation characteristics of speakers, precisely reproducing tone, intonation, and rhythm, avoiding the stiffness of traditional synthetic speech.
In addition, MiniMax also introduced the T2V framework, combining open natural language descriptions with structured label information, further enhancing the flexibility and controllability of speech synthesis. This means that users can not only provide reference audio but also generate the desired voice by simple descriptions, greatly enhancing the system's versatility.
The success of Speech-02 once again confirms the strong capabilities of domestic large models in the field of speech synthesis and showcases China's rapid rise in artificial intelligence technology to the world.
Technical Documentation: https://minimax-ai.github.io/tts_tech_report/