OpenAI adds MCP and SIP support to gpt-realtime for smarter voice-based agents

vendredi 29 août 2025, 14:32 , par InfoWorld

OpenAI has added remote model context protocol (MCP) Server and session initiation protocol (SIP) support to its speech-to-text large language model gpt-realtime via its dedicated API to help enterprises build more autonomous voice-based agents.

The support for remote MCP Servers in the Realtime API — now generally available — is designed to let developers program voice-based agents to access external capabilities and tools listed as MCP Servers on the internet or a different server, said Charlie Dai, VP and principal analyst at Forrester.

Remote MCP Servers are MCP Servers that are not listed locally where the agent or agentic application is running.

Enterprises can enable MCP support in an API session by passing the URL of a remote MCP server into the session configuration, OpenAI said.

“Once connected, the API automatically handles the tool calls for you, so there’s no need to wire up integrations manually. This setup makes it easy to extend your agent with new capabilities,” the company explained in a blog post.

The added support for SIP, which is a standard for initiating and managing real-time voice calls over IP networks, will allow enterprises to integrate AI voice agents directly with PBX systems and phone networks, Dai said.

“Examples of use cases where enterprises can take advantage of SIP support in the API include automated call handling, appointment scheduling, and multilingual support for customer services in contact centers,” Dai added.

Image input and additional capabilities

To make the gpt-realtime model more effective for voice-based use cases, OpenAI has added support for images, meaning users can now include visuals, such as photos, screenshots, or other images, alongside text or audio in a session.

This allows the model to interpret and respond based on what’s visually presented, making it possible to ask questions such as “What do you see?” or “Can you read the text in this image?” according to OpenAI’s blog post.

According to analysts, the option to input images is a great addition that will come in handy for enterprises.

“This can be seen as the multimodal support, which is a key area in the market,” Dai said, adding that rivals such as Google with Project Astra are also focusing on multimodal, real-time assistance.

Other than adding image input as a capability, OpenAI has added better context awareness and memory to gpt-realtime as a model.

Additionally, the model provider said that the updated gpt-realtime model has shown improvements in following complex instructions, calling tools with precision, and producing speech that “sounds more natural and expressive.”

These improvements, according to Dai, would help enterprises use the API for enabling low-latency, natural voice interactions for a spectrum of use cases, such as real-time medical transcription, conversational booking assistants, customer service for banking, insurance, and telco, and employee enablement across major verticals.

Enterprises accessing the model through the API can use two new voices, Cedar and Marin, the model provider said.

OpenAI’s largest investor, Microsoft, too, has announced two text-to-speech models this week, which the technology giant said will help it unlock enterprise use cases along a broad spectrum.

Lire la suite sur InfoWorld