MiniMax Audio's Speech-02 series of voice models has taken the world by storm, dominating both the Artificial Analysis Speech Arena and Hugging Face TTS Arena, two authoritative rankings, outperforming top international competitors like ElevenLabs and OpenAI. This model impresses with its ultra-high fidelity and multi-language support, setting a new benchmark for AI voice technology. AIbase delves into the technical highlights of Speech-02 and its profound impact on the industry.

twitter_orange.ai(@oran_ge)_20250516-061833_1923261769776234999_photo-0.jpg

Top Rankings on Both Charts: Objective and Subjective Excellence

The Speech-02 series includes two models, Speech-02-HD and Speech-02-Turbo, optimized for high-fidelity and real-time applications respectively. In the ELO scores of Artificial Analysis Speech Arena, Speech-02-HD tops the global list with outstanding voice quality, while Speech-02-Turbo ranks third. The blind test results from Hugging Face TTS Arena also show that Speech-02 surpasses ElevenLabs and OpenAI's latest models in user subjective feedback, earning unanimous praise from the community.

AIbase analyzes that voice, as a modality with both objective and subjective attributes, requires assessment through quantifiable indicators and blind tests. Speech-02 achieves industry-leading performance in metrics such as Word Error Rate (WER) and speaker similarity, and in subjective listening experience, it reaches 99% human likeness with no rhythm flaws, delivering a smooth and natural auditory experience. This dual advantage makes it particularly prominent in scenarios like podcasts, audiobooks, and real-time interactions.

twitter_orange.ai(@oran_ge)_20250516-061833_1923261769776234999_photo-1.jpg

Technological Breakthroughs: Zero-Sample Cloning and Multi-Language Support

The core innovation of Speech-02 lies in its zero-sample voice cloning and multi-language coverage capabilities. AIbase learns that this model can complete high-precision voice cloning with just 10 seconds of audio, making the cloned voice indistinguishable from the original. Users can generate emotionally expressive speech through simple text prompts, supporting various emotions such as happiness, sadness, anger, etc., greatly enhancing the speech's appeal.

In addition, Speech-02 supports over 30 languages, including Chinese, English, Japanese, Korean, Arabic, etc., covering major global languages and achieving native pronunciation effects. Its dynamic pause control function allows users to insert pauses of 0.01 to 99.99 seconds using the <#x#> tag, making the speech rhythm more natural and suitable for complex scenarios such as audiobooks and AI dubbing. AIbase testing shows that Speech-02-HD maintains stability and high-quality output when generating long text voices up to 200,000 characters.

Architectural Innovation: Flow-VAE and Learnable Encoder

According to MiniMax's technical report, Speech-02 adopts an autoregressive Transformer architecture combined with learnable speaker encoders and Flow-VAE technology. The former extracts timbre features from reference audio without transcription to achieve zero-sample cloning; the latter enhances overall audio synthesis quality, ensuring consistent timbre and expressiveness. AIbase believes that this architectural design not only improves voice realism but also sets multiple records in objective evaluations across 32 languages, solidifying its leading position in the industry.

The low-latency characteristics of Speech-02 are also remarkable. Speech-02-Turbo achieves instant audio stream output in real-time applications, generating at a speed of thousands of characters per second, suitable for virtual assistants and real-time translation scenarios. Speech-02-HD, on the other hand, focuses on high-fidelity scenarios such as professional dubbing and audiobook production, meeting diverse needs.

Industry Impact: Reshaping the AI Voice Application Ecosystem

The release of Speech-02 marks the entry of AI voice technology into a new era of high fidelity and low cost. AIbase observes that its top rankings on Artificial Analysis and Hugging Face have sparked widespread discussion, with community developers testing its applications in podcasts, educational content, and AI assistants. Compared to ElevenLabs' high pricing (approximately $100/million characters), Speech-02-HD and Turbo are priced at $50 and $30/million characters respectively, offering more affordable options for small and medium-sized enterprises and independent developers.

In addition, MiniMax provides API support for Speech-02 via fal.ai and Replicate platforms, allowing developers to easily integrate it into existing workflows. AIbase predicts that Speech-02's low threshold and high performance will promote the popularization of AI voice globally, especially showing great potential in multi-language education, cross-border e-commerce, and immersive entertainment fields.

Global Breakthrough for Domestic AI

As a specialized media outlet for the AI field, AIbase highly recognizes MiniMax Speech-02's double first-place ranking. Its zero-sample cloning, multi-language support, and low latency not only surpass OpenAI and ElevenLabs but also demonstrate China's global competitiveness in voice technology. AIbase specifically notes the ecological synergy potential between Speech-02 and domestic models like Qwen3, which may further accelerate China's internationalization of AI technology.