OpenAI Outlines WebRTC Architecture for Low-Latency Voice AI at Scale

OpenAI recently outlined how it adapted WebRTC for low-latency voice AI at global scale. The new architecture replaced a conventional media termination model with a relay-transceiver design better suited to Kubernetes and cloud load balancers. It keeps WebRTC session state in a dedicated transceiver layer while using lightweight relays to reduce public UDP exposure and keep media routing close to users.

In the article, Yi Zhang and William McDonald, OpenAI members of technical staff, explain that global reach, fast connection setup, and low, stable media round-trip times were the main constraints behind the change. The team evaluated several approaches for exposing media sessions, each with different operational trade-offs.

The first alternative was direct per-session UDP exposure, which preserves the conventional WebRTC model. However, it pushes operational complexity into the infrastructure layer, especially in Kubernetes environments, where large public port ranges are difficult to manage safely. Allocating unique ports per server simplifies some routing decisions, but still leaves operators dealing with port planning, uneven utilisation, and more brittle rollout patterns.

Option 1: The SFU approach includes AI as a WebRTC participant (source)

TURN-style relays were also a plausible option, but they introduce a heavier intermediary into the media path and solve a wider problem than OpenAI needed for predominantly 1:1 model-to-user sessions. OpenAI instead chose to split responsibilities between two layers. A lightweight relay accepts incoming packets and forwards them, while a separate transceiver owns all of the stateful WebRTC machinery, including ICE negotiation, DTLS handshakes, SRTP encryption, and overall session lifecycle.

Option 2: The tranceiver approach terminates WebRTC at the edge and converts to a backend protocol (source)

This separation means the relay can remain simple, fast, and largely stateless, while the transceiver is the only component that needs to understand the full protocol. That keeps complexity concentrated in one place rather than duplicating it across backend services or pushing it into client behaviour. "The best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior," the authors state.

Relay statelessly forwards packets to the transceiver (source)

WebRTC is a common choice for real-time AI workloads. Beyond low-latency media delivery, it also provides NAT traversal, encrypted transport, codec negotiation, jitter buffering, and audio features such as echo cancellation across browsers and mobile platforms. STUN is part of that foundation, helping endpoints discover how they appear on the network and supporting ICE during connectivity checks.

Many teams default to selective forwarding units, or SFUs, because they centralise media routing and policy for multi-party systems. However, OpenAI’s workloads are mostly 1:1 sessions between a user and a model, making a transceiver design a better fit than treating the model as another participant in a conferencing-style architecture.

The post adds infrastructure detail to OpenAI’s broader real-time voice push, already available in products such as ChatGPT voice and the Realtime API. For architects building interactive media systems, the more interesting pattern is the decomposition itself: preserve protocol behaviour at the edge, keep hard session state in one place, and move scaling complexity into a thin routing layer rather than spreading it across backend services.

About the Author

Eran Stiller

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Eran Stiller

Rate this Article

This content is in the OpenAI topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter