Google has officially launched the new text-to-speech model Gemini-TTS in the Gemini 3.1 series. The official positioning is direct and confident: "The most expressive text-to-speech solution to date."
The core breakthrough of this model lies in truly giving developers control over speech. Previously, TTS products often generated voices that were monotonous, with flat intonation, rigid rhythm, and shallow emotion. Gemini-TTS, however, supports direct control over the emotion, rhythm, and style of the voice through prompts—whether it's a narration requiring a deep and solemn tone or a conversation needing a relaxed and natural feel, pauses and emotional fluctuations can be precisely controlled by describing them in language. The naturalness and delicacy of the listening experience have taken a significant step forward compared to previous similar products.

In terms of multilingual support, Gemini-TTS covers approximately 70 languages, including mainstream languages such as Mandarin Chinese, English, Spanish, and Japanese. More conveniently, the model can automatically recognize the language of the input text without requiring developers to manually annotate it, directly generating voice output in the corresponding language. For enterprises serving global users, this means a single API can handle multilingual content voice requirements, with audiobooks, podcasts, customer service robots, and educational applications all directly benefiting from this feature.
Google also emphasized the collaborative capabilities of Gemini-TTS with other audio models in the same series. In real-time conversations, voice translation, and multimodal interaction scenarios, the system can finely adjust voice output through text prompts and audio tags while maintaining low latency, making AI sound more like real human communication in practical applications such as phone calls, meetings, and navigation.



