OpenAI Unveils Major Update! GPT-Realtime Speech Model Launches, Supports Image Input

OpenAI Unveils Major Update! GPT-Realtime Speech Model Launches, Supports Image Input - AI Interaction Is About to Go Rogue!

AIbase基地

Published inAI News · 8 min read · Aug 29, 2025

OpenAI has officially launched its latest speech model, GPT-Realtime. This multimodal speech agent model has sparked industry discussion with its powerful reasoning capabilities, support for image input, and optimized instruction-following features. According to the latest information from AIbase, GPT-Realtime not only achieves breakthroughs in speech interaction but also provides developers with a smarter and more flexible speech agent solution by integrating features such as image input, remote MCP, and SIP phone calls.

GPT-Realtime: Pioneer of Multimodal Speech Interaction

GPT-Realtime is OpenAI's most advanced speech-to-speech model to date, specifically designed for production-level speech agents. It uses a single model to directly process and generate audio, significantly reducing latency issues in traditional speech interactions. Unlike traditional speech interaction systems that require multiple models such as speech-to-text (STT), text reasoning, and text-to-speech (TTS), GPT-Realtime retains subtle details such as tone, emotion, and accent through an end-to-end architecture, providing a more natural and smooth conversation experience. The model supports multiple modal inputs, including text, audio, and images, marking a significant breakthrough for OpenAI in the field of multimodal AI.

Core Capabilities: Intelligent Reasoning and Nonverbal Signal Capture

GPT-Realtime demonstrates exceptional performance in intelligence, reasoning, and understanding, especially in handling complex interaction scenarios. Its key highlights include:

- Nonverbal Signal Recognition: The model can sensitively capture nonverbal cues such as laughter and pauses, enhancing the naturalness and human-like experience of interactions.

- Language Switching and Tone Adjustment: It supports seamless language switching during conversations and adjusts tone according to scenario requirements, such as "professional customer service" or "enthusiastic guidance," meeting diverse application needs.

- High-Precision Reasoning: In the BigBenchAudio benchmark test, GPT-Realtime achieved a reasoning accuracy of 82.8%, a significant increase from the previous model's 65.6%, demonstrating strong logical processing capabilities.

- Instruction Following Optimization: In the MultiChallenge audio benchmark test, the instruction following accuracy increased from 20.6% to 30.5%, ensuring the model strictly follows complex instructions set by developers, such as reading legal statements word-for-word or processing alphanumeric sequences.

New Features: Image Input and Communication Integration

The release of GPT-Realtime brings several innovative features, further expanding the application scenarios of speech agents:

- Image Input Support: The model can process image inputs and describe their content, adding visual context to speech interactions, suitable for educational and customer support scenarios.

- Remote MCP and SIP Phone Calls: By supporting remote Model Context Protocol (MCP) and Session Initiation Protocol (SIP), developers can integrate GPT-Realtime into phone systems or external tools, enabling broader real-time interactions.

- Fine-Grained Context Control: The model supports reusable prompts and session trimming functions, allowing developers to precisely manage conversation context, optimizing cost and performance.

Cost Optimization: More Cost-Effective Production-Level Speech Agents

OpenAI has reduced the price of the Realtime API in this update, lowering the cost of audio input to $32 per million tokens and audio output to $64 per million tokens, a 20% reduction from previous rates, offering developers a more economical solution. Compared to traditional speech interaction pipelines, GPT-Realtime significantly reduces latency and costs by using a single model, helping enterprises deploy efficient speech agents in customer support, personal assistants, and education sectors.

Industry Impact: Intensifying Competition in Speech AI

The launch of GPT-Realtime has further intensified competition in the speech AI market. Companies like Anthropic, Meta, and Mistral have also accelerated their layout in speech technology recently, such as Anthropic's Claude voice mode and Mistral's Voxtral model. OpenAI has solidified its leading position in the speech AI field through GPT-Realtime's low latency, high expressiveness, and multimodal support. Industry analysts believe that the model's image input and communication integration features will promote the popularization of speech agents in enterprise applications, particularly in customer service centers and real-time translation scenarios.

Future Outlook: Cornerstone of a Multimodal AI Ecosystem

OpenAI stated that GPT-Realtime is a crucial step in its multimodal strategy. In the future, it will further expand to video and other modalities, providing developers with more comprehensive AI interaction tools. Combined with OpenAI's recent release of the Agents SDK, developers can upgrade existing text applications to speech interaction applications with just a few lines of code, greatly reducing the development barrier. AIbase expects that GPT-Realtime's openness and high performance will accelerate the commercialization of speech agents globally.

GPT-Realtime sets a new benchmark in the speech AI field with its outstanding multimodal capabilities, optimized instruction following, and cost advantages. Through the integration of image input and communication features, OpenAI not only enhances the practicality of speech agents but also creates a more flexible and efficient development environment for developers. This release is undoubtedly pushing AI interaction technology to new heights, and it is worth continuous attention from the industry.

API Address: https://platform.openai.com/docs/guides/realtime

Tsinghua Changgeng Hospital Collaborates with Beijing Electronic Information and Intelligence to Develop China's First Pharmaceutical Large Model: Focused on Medication Safety Evaluation for Special Populations

Beijing Tsinghua Changgeng Hospital has collaborated with Beijing Electronic Information and Intelligence to develop China's first pharmaceutical-specific large model, using AI to optimize pharmaceutical processes, improve the efficiency and accuracy of medication safety evaluation for special populations such as the elderly, children, and pregnant women, and address the challenges of rapid updates in drug information and complex individual differences.

OpenAI Video Generation Model Sora 2 Launches on Microsoft Azure Platform: Pricing at $0.10 per Second, Enters Public Preview Phase

Microsoft launches OpenAI's Sora2 video generation model on Azure AI for public preview, offering cloud API access to businesses and developers. This multimodal tool processes text, image, and video inputs to create new content, advancing generative AI video into commercial applications like advertising.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

OpenAI Unveils Major Update! GPT-Realtime Speech Model Launches, Supports Image Input - AI Interaction Is About to Go Rogue!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Tsinghua Changgeng Hospital Collaborates with Beijing Electronic Information and Intelligence to Develop China's First Pharmaceutical Large Model: Focused on Medication Safety Evaluation for Special Populations

OpenAI Suspends Sora from Generating Video of Martin Luther King Jr. to Protect Historical Figures' Image

AI Daily: Google Gemini 3.0 Pro is being rolled out on a limited scale; Aishike Technology completes B+ round financing of 100 million yuan; Baidu releases document parsing model PaddleOCR-VL

AI Daily: ByteDance Launches DouBao Large Model 1.6; AiShi Technology Completes 100 Million RMB B+ Funding Round; Baidu Releases Document Parsing Model PaddleOCR-VL

Baidu Releases Global Leading Document Parsing Model PaddleOCR-VL, Reshaping the OCR Technology Landscape!

OpenAI Collaborates with the Martin Luther King Jr. Estate, Temporarily Halting Sora from Generating Portraits of Dr. King

OpenAI Video Generation Model Sora 2 Launches on Microsoft Azure Platform: Pricing at $0.10 per Second, Enters Public Preview Phase

OpenAI Sora 2 New Features Launch, Pro Users Can Generate Videos Up to 25 Seconds

LLaVA-OneVision-1.5, a Fully Open-Source Multimodal Model That Exceeds Qwen2.5-VL

Google DeepMind and Yale University Collaborate to Develop AI Model C2S-Scale 27B for Cancer Treatment Pathways

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

OpenAI Unveils Major Update! GPT-Realtime Speech Model Launches, Supports Image Input - AI Interaction Is About to Go Rogue!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Tsinghua Changgeng Hospital Collaborates with Beijing Electronic Information and Intelligence to Develop China's First Pharmaceutical Large Model: Focused on Medication Safety Evaluation for Special Populations

OpenAI Suspends Sora from Generating Video of Martin Luther King Jr. to Protect Historical Figures' Image

AI Daily: Google Gemini 3.0 Pro is being rolled out on a limited scale; Aishike Technology completes B+ round financing of 100 million yuan; Baidu releases document parsing model PaddleOCR-VL

AI Daily: ByteDance Launches DouBao Large Model 1.6; AiShi Technology Completes 100 Million RMB B+ Funding Round; Baidu Releases Document Parsing Model PaddleOCR-VL

Baidu Releases Global Leading Document Parsing Model PaddleOCR-VL, Reshaping the OCR Technology Landscape!

OpenAI Collaborates with the Martin Luther King Jr. Estate, Temporarily Halting Sora from Generating Portraits of Dr. King

OpenAI Video Generation Model Sora 2 Launches on Microsoft Azure Platform: Pricing at $0.10 per Second, Enters Public Preview Phase

OpenAI Sora 2 New Features Launch, Pro Users Can Generate Videos Up to 25 Seconds

LLaVA-OneVision-1.5, a Fully Open-Source Multimodal Model That Exceeds Qwen2.5-VL

Google DeepMind and Yale University Collaborate to Develop AI Model C2S-Scale 27B for Cancer Treatment Pathways

GEO Services