Menu
InfoQ Architecture·May 20, 2026

OpenAI's WebRTC Architecture for Low-Latency Voice AI

OpenAI has redesigned its WebRTC architecture to achieve low-latency voice AI at a global scale, moving from a conventional media termination model to a relay-transceiver design. This new approach optimizes for Kubernetes environments and cloud load balancers by separating stateless relays from stateful transceivers. The architecture focuses on reducing public UDP exposure and keeping media routing close to users for improved performance.

Read original on InfoQ Architecture

OpenAI's new WebRTC architecture for low-latency voice AI addresses critical constraints such as global reach, fast connection setup, and stable, low media round-trip times. The core innovation lies in its departure from traditional WebRTC models, opting for a design better suited to the operational realities of large-scale cloud deployments and Kubernetes.

Architectural Shift: Relay-Transceiver Design

The key architectural decision was to replace a conventional media termination model with a novel relay-transceiver separation. Unlike direct per-session UDP exposure or heavy TURN-style relays, this design introduces two distinct layers:

  • Lightweight Relays: These components are largely stateless, accepting incoming packets and forwarding them to the appropriate transceiver. Their primary role is to reduce public UDP exposure and keep media routing geographically close to the users, minimizing latency.
  • Stateful Transceivers: This layer owns all the complex, stateful WebRTC machinery. This includes ICE negotiation, DTLS handshakes, SRTP encryption, and managing the overall session lifecycle. By centralizing this complexity, the design avoids duplicating state across numerous backend services.
💡

Design Principle: Centralized Complexity

The article highlights a crucial system design principle: "The best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior." This advocates for encapsulating protocol-specific complexities in a dedicated, optimized layer rather than scattering it throughout the application.

Trade-offs and Alternatives Considered

OpenAI evaluated several alternatives before settling on the relay-transceiver model:

  • Direct Per-Session UDP Exposure: While conventional for WebRTC, this approach pushes significant operational complexity into the infrastructure layer, especially with Kubernetes, due to challenges in managing large public port ranges safely and effectively.
  • TURN-style Relays: These introduce a heavier intermediary, solving a broader set of problems than required for OpenAI's predominantly 1:1 user-to-model sessions, adding unnecessary overhead.
  • SFU (Selective Forwarding Unit) Approach: Commonly used for multi-party conferencing, an SFU treats each participant (including the AI model) as an equal. However, for 1:1 sessions, the transceiver design provides a more efficient fit by treating the model as a backend service rather than another peer.

This architectural decomposition effectively preserves WebRTC protocol behavior at the edge while concentrating hard session state and scaling complexity into a manageable routing layer, making it highly relevant for architects building interactive media and AI systems at scale.

WebRTCVoice AILow LatencyKubernetesCloud ArchitectureReal-time SystemsDistributed MediaOpenAI

Comments

Loading comments...
OpenAI's WebRTC Architecture for Low-Latency Voice AI | SysDesAi