Alibaba Qwen released the new generation of text-to-speech large model, Qwen3-TTS, which is now freely available to developers worldwide through the Qwen API. The model offers 49 multi-character voice options, supports 10 major languages and 10 Chinese dialects, and the official claims that its average word error rate (WER) on the MiniMax TTS multilingual test set is better than MiniMax and ElevenLabs, with a level of naturalness approaching that of real people.

image.png

49 Voice Options Ready to Use  

- Character Library: Includes gender, age, region, and character settings - "Coquettish and funny Moutu", "Strict Teacher Mo Teacher", "Wisdom Elder Cang Mingzi", etc., can be switched with one click  

- Scenario Adaptation: Podcasts, audiobooks, game NPCs, and smart customer service can switch voices in seconds without additional training

10 Languages and 10 Dialects, Leading WER Across Languages  

- Major Languages: Covering 10 languages including Chinese, English, German, Italian, and French  

- Dialect List: Including Mandarin, Cantonese, Sichuan dialect, etc., 10 dialects retain authentic accents and intonation  

- Objective Metrics: The average WER on the MiniMax TTS multilingual test set is lower than ElevenLabs, with a synthesis accuracy increase of about 12%

image.png

Rhythm and Speed: Text-Driven, Naturalness Close to Real People  

- Adaptive Speed: Automatically adjusts speed and pauses based on the text's emotion  

- Rhythm Model: Predicts stress and intonation at the syllable level, with a MOS score of 4.6, close to real people's 4.8  

- Real-Time Streaming: First packet delay <300ms, suitable for live dubbing and dialogue scenarios

Free Access & Business-Friendly  

- API Pricing: Currently free and no call limit  

- Licensing Terms: Default support for commercial use, no additional licensing fees required  

- Integration Example: A single HTTPS request can be integrated, completing voice broadcasting with 10 lines of code

Next Step: Dialect Cloning + Edge Deployment 

Alibaba revealed that in Q1 2025, it will launch the "Dialect Voice Cloning" feature, allowing a 5-second audio clip to recreate regional accents; in Q2, it will release an edge box version, supporting offline local network deployment, targeting scenarios such as smart scenic spots and in-car voice systems.

Editor's Note  

When text-to-speech technology has reached the stage where "voice is a character," Qwen3-TTS differentiates itself with 49 character settings, 10 dialects, and free APIs: voices can be switched instantly without training, and WER metrics directly compete with international paid engines. For applications that rely heavily on voice and style, such as podcasts, games, and customer service, this effectively brings the cost of "voice actors + post-production" close to zero.