ByteByteGo·July 1, 2026

OpenAI's Low-Latency Voice AI Architecture with WebRTC and Stateless Relays

This article details OpenAI's system architecture for delivering low-latency voice AI to 900 million users, addressing the challenges of scaling WebRTC on Kubernetes. It highlights a unique split architecture featuring stateless relays at the edge and stateful transceivers, optimizing for 1:1 user-to-model interactions while maintaining WebRTC's benefits and leveraging existing protocol fields for efficient routing.

Distributed Systems Performance & Scaling Cloud & Infrastructure

Read original on ByteByteGo

The Challenge: Scaling WebRTC on Kubernetes for Voice AI

OpenAI faced significant architectural hurdles scaling WebRTC for its voice AI service, which serves 900 million weekly active users. The core problem stems from the mismatch between WebRTC's assumption of stable IP/port assignments and Kubernetes' ephemeral, dynamic nature. Traditional WebRTC deployments, often using one UDP port per session, lead to port exhaustion and operational complexity when deployed at scale on Kubernetes. Furthermore, WebRTC's stateful nature for protocols like ICE and DTLS demands session stickiness, clashing with Kubernetes' pod mobility. The goal was to deliver a natural, low-latency conversational experience, requiring continuous audio streams and quick connection setups globally.

OpenAI's Split Architecture: Relay and Transceiver

To overcome these challenges, OpenAI implemented a split architecture:Stateless Relay: Positioned at the geographic edge, this component handles protocol-aware packet routing. It presents a small public footprint and forwards packets to the appropriate transceiver with minimal processing, keeping audio encrypted end-to-end.Stateful Transceiver: Located behind the relay, this component owns all the heavy WebRTC state, including ICE connectivity checks, DTLS handshakes, SRTP encryption keys, and the session lifecycle. It is the actual endpoint completing WebRTC handshakes and encrypting/decrypting media.This separation effectively solves the port exhaustion problem by multiplexing many sessions behind a shared UDP socket on the transceiver. It also manages state stickiness by ensuring that a session's state remains with its assigned transceiver, even if relay instances are ephemeral.

Smart Routing with ICE Ufrag

A critical innovation is how the stateless relay routes the initial packet of a new session. Instead of relying on a database lookup (which adds latency) or random routing (which doubles hops), OpenAI leverages the ICE username fragment (ufrag). The ufrag, an existing field in STUN binding requests exchanged during WebRTC setup, is generated server-side with embedded routing metadata. The relay parses just enough of the first STUN packet to read this ufrag, decode the routing hint, and forward it to the correct transceiver. Subsequent packets use an established in-memory mapping or a Redis cache for faster lookup.

💡

System Design Principle: Leverage Existing Protocol Fields

When designing high-performance distributed systems, consider if existing protocol fields can be repurposed or extended with metadata (like routing hints) to avoid introducing new lookup mechanisms, thereby reducing latency and dependencies. This principle aligns with minimizing work on the hot path.

Global Relay and Geo-Steered Signaling

By reducing the public UDP surface to a fixed set of addresses, the relay pattern became deployable globally. OpenAI uses a "Global Relay" fleet, geographically distributed ingress points that shorten the first client-to-OpenAI hop. This proximity-based routing lowers latency, stabilizes timing, and improves loss profiles before traffic reaches OpenAI's backbone. Clients connect to a Virtual IP (VIP) fronting the entire relay fleet, ensuring a stable destination despite many instances.

Minimized Public Footprint: Reduced exposed UDP ports, improving security and operational simplicity.
Stateless Edge: Relay instances can be ephemeral, aligning with Kubernetes' model.
Optimal for 1:1 Traffic: This architecture is tailored for one user talking to one model, unlike SFUs which are optimized for multi-party calls.
Latency Optimization: Leveraging ufrag for routing and global relay deployment minimizes round-trip times.