In recent development updates, Google released the Gemini2.5 version, marking significant progress in AI audio dialogue and generation technology. Gemini2.5 is a multimodal AI system capable of natively understanding and generating text, images, audio, video, and code, enhancing user interaction with AI.
The real-time audio dialogue function of Gemini2.5 makes human-machine communication more natural. Human conversations often involve tone, accent, and non-verbal sounds like laughter, all of which can be reflected through Gemini's audio generation technology. Its low latency ensures smooth and natural communication, allowing users to adjust the conversation style naturally, such as choosing different accents and tones, or even whispering during communication.
Real-Time Audio Dialogue
Human conversations are rich and nuanced, conveying meaning not only through spoken words but also through tone, accent, and non-verbal sounds like laughter. Gemini2.5 aims to achieve efficient and real-time communication through audio, including the following features for its audio dialogue function:
- Natural Conversation: Provides high-quality voice interaction, showcasing appropriate expressiveness and rhythm, making the conversation smooth and natural with extremely low latency.
- Style Control: Users can customize the tone, accent, and emotional expression of the conversation via natural language prompts, and even whisper during communication.
- Tool Integration: During conversations, Gemini2.5 can call tools and functions to retrieve information from sources like Google Search in real time, enhancing the practicality of the conversation.
- Dialogue Context Awareness: The system can identify and ignore background noise and irrelevant conversations, ensuring timely responses at appropriate moments.
- Audio and Video Understanding: Supports real-time audio and video streams, enabling discussions about video content or screen-shared information with users.
- Multi-Language Support: Supports over 24 languages and can flexibly switch languages within the same conversation.
- Emotional Dialogue: Responds based on the user's tone, understanding the emotional differences in various expressions.
- Advanced Thinking Dialogue: Enhances conversational coherence and intelligence by leveraging reasoning capabilities, particularly excelling in complex problem-solving scenarios.
Controllable Text-to-Speech Technology
Gemini2.5 has made breakthroughs in text-to-speech (TTS) technology, allowing users to generate natural voice outputs while exerting unprecedented control over the audio. Users can generate content ranging from short phrases to long narratives, precisely controlling style, tone, emotion, and performance, all of which can be adjusted through natural language prompts.
- Dynamic Performance: Can read texts vividly, suitable for poetry recitation, news broadcasting, and storytelling, supporting specific emotions and accents.
- Speed and Pronunciation Control: Users can control speech speed and ensure accurate pronunciation of specific words.
- Multi-speaker Dialogue Generation: Can generate two-person dialogue audio based on textual input, making the content more engaging.
- Multi-language Audio Generation: Easily generates multi-language audio content, supporting over 24 languages.
During the development of Gemini2.5, Google conducted a comprehensive assessment of potential risks and implemented corresponding mitigation strategies. All audio outputs are embedded with a watermarking technology called SynthID to ensure transparency and recognizability of AI-generated audio.
Gemini2.5 provides developers with rich native audio functionalities, allowing them to build more interactive applications through Google AI Studio or Gemini APIs in Vertex AI. Developers can test the native audio dialogue of Gemini2.5 Flash Preview in the streaming tab of Google AI Studio or choose controllable text-to-speech generation, driving audio innovation in applications such as announcements, stories, podcasts, and video games.