OpenAI has introduced a WebSocket-based execution mode for its responses API, replacing traditional HTTP request-response with persistent, bidirectional connections. This architectural shift significantly reduces latency and improves throughput in multi-step AI agentic workflows by minimizing network round-trip times and connection overhead. It highlights the critical role of transport layer optimization in modern distributed AI systems.
Read original on InfoQ ArchitectureAgentic AI workflows, such as those found in coding agents or real-time AI systems, often involve multiple steps: tool calls, intermediate reasoning, and follow-up queries. Historically, each step required a separate HTTP request, leading to significant latency. As AI inference speeds improved, the repeated network round-trip times and connection handshakes became the primary bottleneck, impacting overall system performance and increasing operational complexity. This scenario underscores a common challenge in distributed systems where the overhead of communication protocols can overshadow the processing time of individual tasks.
To address this, OpenAI implemented a WebSocket-based execution mode. This change moves from a stateless, request-response model to a stateful, persistent, and bidirectional connection between the client and the server. WebSockets allow for continuous data exchange without the overhead of repeated handshakes and connection setups for each step. This approach is highly effective for applications requiring low-latency, real-time data streaming and frequent, interactive communication. Early production results show up to a 40% latency reduction and improved throughput, demonstrating the impact of transport-layer optimizations.
System Design Insight
When designing systems with frequent, multi-step interactions or real-time streaming requirements, consider WebSockets over traditional HTTP/REST. While HTTP/2 and gRPC offer multiplexing, WebSockets provide a full-duplex, persistent connection ideal for truly interactive and low-latency scenarios, especially where stateful communication patterns are beneficial. However, they introduce complexities in connection lifecycle management, scalability, and handling backpressure.
Adopting WebSockets for such critical pathways introduces new architectural considerations for developers and system architects. These include robust connection lifecycle management (handling disconnections, reconnections, and session state), implementing effective backpressure mechanisms under high concurrency to prevent server overload, and ensuring reliability in a distributed system context where stateful connections must be maintained. The shift aligns with established patterns for stateful distributed systems, requiring careful design of message queues, error handling, and horizontal scaling strategies.