InfoQ Architecture·May 20, 2026

OpenAI's WebRTC Architecture for Low-Latency Voice AI

OpenAI has redesigned its WebRTC architecture to achieve low-latency voice AI at a global scale, moving from a conventional media termination model to a relay-transceiver design. This new approach optimizes for Kubernetes environments and cloud load balancers by separating stateless relays from stateful transceivers. The architecture focuses on reducing public UDP exposure and keeping media routing close to users for improved performance.

Distributed Systems Cloud & Infrastructure Performance & Scaling

Read original on InfoQ Architecture

OpenAI's new WebRTC architecture for low-latency voice AI addresses critical constraints such as global reach, fast connection setup, and stable, low media round-trip times. The core innovation lies in its departure from traditional WebRTC models, opting for a design better suited to the operational realities of large-scale cloud deployments and Kubernetes.

Architectural Shift: Relay-Transceiver Design

The key architectural decision was to replace a conventional media termination model with a novel relay-transceiver separation. Unlike direct per-session UDP exposure or heavy TURN-style relays, this design introduces two distinct layers:

Lightweight Relays: These components are largely stateless, accepting incoming packets and forwarding them to the appropriate transceiver. Their primary role is to reduce public UDP exposure and keep media routing geographically close to the users, minimizing latency.
Stateful Transceivers: This layer owns all the complex, stateful WebRTC machinery. This includes ICE negotiation, DTLS handshakes, SRTP encryption, and managing the overall session lifecycle. By centralizing this complexity, the design avoids duplicating state across numerous backend services.

💡

Design Principle: Centralized Complexity

The article highlights a crucial system design principle: "The best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior." This advocates for encapsulating protocol-specific complexities in a dedicated, optimized layer rather than scattering it throughout the application.

Trade-offs and Alternatives Considered

OpenAI evaluated several alternatives before settling on the relay-transceiver model:

Direct Per-Session UDP Exposure: While conventional for WebRTC, this approach pushes significant operational complexity into the infrastructure layer, especially with Kubernetes, due to challenges in managing large public port ranges safely and effectively.
TURN-style Relays: These introduce a heavier intermediary, solving a broader set of problems than required for OpenAI's predominantly 1:1 user-to-model sessions, adding unnecessary overhead.
SFU (Selective Forwarding Unit) Approach: Commonly used for multi-party conferencing, an SFU treats each participant (including the AI model) as an equal. However, for 1:1 sessions, the transceiver design provides a more efficient fit by treating the model as a backend service rather than another peer.

This architectural decomposition effectively preserves WebRTC protocol behavior at the edge while concentrating hard session state and scaling complexity into a manageable routing layer, making it highly relevant for architects building interactive media and AI systems at scale.

WebRTCVoice AILow LatencyKubernetesCloud ArchitectureReal-time SystemsDistributed MediaOpenAI

Comments

Loading comments...

Architecture Design

Design this yourself

Design a low-latency voice AI system capable of global scale, focusing on the WebRTC media plane architecture. Incorporate a relay-transceiver design to separate stateless packet forwarding from stateful WebRTC session management, leveraging Kubernetes and cloud load balancers while minimizing public UDP exposure and optimizing for 1:1 user-to-model interactions. Detail the roles of relays and transceivers, and how this separation addresses operational complexity and performance challenges.

Practice Interview

Focus: WebRTC media termination and relay architecture for low-latency voice AI

Other design angles

· Design a real-time communication platform for multi-party video conferencing, comparing the suitability of SFU (Selective Forwarding Unit) vs. a transceiver-based architecture for mixed 1:1 and group calls.· Design a scalable API gateway specifically to handle WebRTC connections, focusing on secure session setup, NAT traversal, and efficient media packet routing to backend services for various real-time applications.· Design a distributed system for real-time speech-to-text and text-to-speech, exploring how to integrate WebRTC at the edge with backend AI inference services while maintaining low latency and high availability across global regions.

OpenAI's WebRTC Architecture for Low-Latency Voice AI

Architectural Shift: Relay-Transceiver Design

Trade-offs and Alternatives Considered

Comments

Architecture Design

Related Lessons