OpenAI held a technical live stream at 1 AM and officially launched its new speech model - GPT-Realtime. This multimodal model is designed for speech AI Agents, aiming to generate more natural and smooth speech, capable of imitating the rich and diverse tones, emotions, and speech rates of humans. GPT-Realtime has a wide range of application scenarios, covering areas such as customer service, education, finance, and healthcare, providing strong support for creating intelligent voice assistants.

image.png

GPT-Realtime introduces two unique speech styles - Marin and Cedar, and comprehensively upgrades the original eight speech styles. Unlike traditional speech models, GPT-Realtime can not only generate speech but also has intelligence, reasoning, and understanding capabilities. For example, the model can accurately capture non-verbal signals such as laughter and switch languages flexibly in conversations to adapt to different scenario needs.

In terms of evaluation, GPT-Realtime has significantly improved the accuracy of letter and number sequence detection in multiple language environments, with an accuracy rate of up to 82.8% in reasoning ability assessments, making it a leader among current intelligent speech models. The improvement in instruction following capability is also a major highlight of this model. Developers can customize instructions to enhance the model's response effectiveness. In the MultiChallenge audio benchmark test, GPT-Realtime's instruction following accuracy increased from 20.6% to 30.5%.

Aside from speech generation capabilities, GPT-Realtime also supports image input. Developers can combine images with audio or text in conversations, allowing the model to engage in dialogue based on what the user sees, providing a more personalized interactive experience. Additionally, the new features of Realtime API allow developers to easily connect to remote MCP servers, simplifying the integration process and improving development efficiency.

In terms of security and privacy, Realtime API is equipped with multiple layers of protection measures, monitoring conversation content in real-time to prevent abuse. At the same time, developers can add custom security protection as needed to ensure the safety of the usage environment.

From the day of release, all developers can use the new Realtime API and GPT-Realtime model. The price of audio input tokens has been reduced by 20%. Additionally, developers can flexibly set smart token limits to reduce the cost of long conversations.

Key Points:

🌟 GPT-Realtime is OpenAI's latest multimodal speech model, suitable for areas such as customer service and education.

📈 The model has significant improvements in reasoning ability and instruction following accuracy, providing stronger support for developers.

🔒 Realtime API is equipped with security protection measures, ensuring the safety and privacy of user interactions.