OpenAI has redesigned its WebRTC architecture to achieve low-latency voice AI at a global scale, moving from a conventional media termination model to a relay-transceiver design. This new approach optimizes for Kubernetes environments and cloud load balancers by separating stateless relays from stateful transceivers. The architecture focuses on reducing public UDP exposure and keeping media routing close to users for improved performance.
Read original on InfoQ ArchitectureOpenAI's new WebRTC architecture for low-latency voice AI addresses critical constraints such as global reach, fast connection setup, and stable, low media round-trip times. The core innovation lies in its departure from traditional WebRTC models, opting for a design better suited to the operational realities of large-scale cloud deployments and Kubernetes.
The key architectural decision was to replace a conventional media termination model with a novel relay-transceiver separation. Unlike direct per-session UDP exposure or heavy TURN-style relays, this design introduces two distinct layers:
Design Principle: Centralized Complexity
The article highlights a crucial system design principle: "The best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior." This advocates for encapsulating protocol-specific complexities in a dedicated, optimized layer rather than scattering it throughout the application.
OpenAI evaluated several alternatives before settling on the relay-transceiver model:
This architectural decomposition effectively preserves WebRTC protocol behavior at the edge while concentrating hard session state and scaling complexity into a manageable routing layer, making it highly relevant for architects building interactive media and AI systems at scale.