Microsoft has officially announced that its latest speech-to-speech (S2S) model, GPT-realtime, has been officially released on the Azure AI Foundry platform. This new model integrates Microsoft's multiple improvements in speech technology into a unified product, with core advantages focusing on natural language processing, excellent audio quality, and more accurate command following capabilities.
Developers can now access GPT-realtime through a new Real-time API. The model is designed to provide more natural and expressive speech output and a higher quality audio experience. As part of this release, Microsoft also introduced two new voice options—Marin and Cedar—intended to offer realistic and clear speech synthesis for users.
In the announcement, Microsoft highlighted several key improvements in the new model, including enhanced function calling capabilities, higher accuracy in command execution, and innovative image input support. This new feature allows users to add images to voice conversations and discuss them, enabling multimodal interaction without relying on video streams.
In addition to technical upgrades, Microsoft also adjusted its pricing model. Compared to the previous gpt-4o-realtime preview version, the official version of gpt-realtime has reduced its price by 20%, with costs calculated based on the usage of per million tokens (token).
This release marks Microsoft's commitment to expanding its real-time AI capabilities for developers and enterprises. By combining expressive speech synthesis, high-quality audio, and multimodal input, GPT-realtime is expected to provide strong technical support for a wide range of applications, from advanced customer support systems to innovative assistive tools.