OpenAI has officially launched its "Realtime API" for production use, marking an important step forward for the company in the field of voice interaction technology. This API is primarily aimed at companies and developers building voice assistants for practical applications such as customer support, education, or personal productivity. The core component is the new GPT-Realtime model. This model can generate and process voice directly, without the traditional text conversion steps, resulting in faster and more natural conversations.

Key Features and Significant Performance Improvements

The new GPT-Realtime model has achieved several technological breakthroughs. It can now capture and understand non-verbal cues such as laughter, switch between different languages smoothly within the same sentence, and adjust tone according to instructions, such as "speaking in a friendly French accent" or "quickly and professionally." In addition, the model introduces two new voices: Cedar and Marin, and optimizes existing voices, further enhancing the user experience.

In benchmark tests, GPT-Realtime performed well, achieving an accuracy rate of 82.8% on Big Bench Audio (higher than 65.6%), 30.5% on MultiChallenge (higher than 20.6%), and 66.5% on ComplexFuncBench (higher than 49.7%). These figures show that the new model has made significant progress in handling complex instructions and multilingual tasks.

OpenAI, ChatGPT, artificial intelligence, AI

Better Integration and Lower Prices

The new API simplifies tool integration, allowing the model to more reliably select and use the correct tools and parameters. Developers can now connect external services via SIP and remote MCP servers and use reusable prompts to save different configurations.

Additionally, the image input feature is now available. Users can send screenshots or photos during a conversation, and the model can reference and understand the content in the image, such as reading text or answering related questions. Developers can flexibly control the range of content the model can see.

For cost control, the new API allows developers to set token limits and streamline long sessions. Additionally, the price of GPT-Realtime has been reduced by 20%. Currently, the cost is $32 per million audio input tokens, $64 per million output tokens, and $0.40 per million cached input tokens.

Safety and Privacy: Protective Measures and User Choices

OpenAI emphasizes that this API can detect and terminate conversations that violate its policies, but also points out that developers should add additional security measures themselves. In terms of data privacy, OpenAI provides specific options allowing EU users to choose to store data within the EU, and has established special privacy rules for enterprise users to ensure data security and compliance.