According to AIbase, Google has announced a major update to its native audio model Gemini 2.5 Flash Native Audio this week, aiming to move AI interactions from simple "text-to-speech" to true human-like real-time communication.
The core of this update lies in its "native" processing capabilities. Unlike traditional AI, which requires converting speech to text before processing, this model can directly perceive tone, emotion, and pauses in sound, enabling more natural and smooth conversations.

Google data shows that the new version's compliance rate with developer instructions has risen from 84% to 90%, demonstrating higher accuracy when handling multi-step workflows. In the audio benchmark ComplexFuncBench, its function call accuracy reached 71.5%, surpassing OpenAI gpt-realtime (66.5%), showing strong competitiveness in the field of live voice agents.
Currently, this technology is fully integrated into Google AI Studio, Vertex AI, Gemini Live, and Search Live. Developers can now experience this upgraded model through the Gemini API, leveraging its stronger consistency and multi-turn conversation memory capabilities to build more reliable and emotionally aware AI assistants.





