This article provides a post-mortem on the challenges faced when scaling a WebSocket-based live commentary platform from 100,000 to 1 million concurrent users. It details how an initially simple fan-out architecture led to Out-Of-Memory (OOM) kills due to slow consumers and backpressure, and outlines the architectural changes implemented to achieve resilience and scalability, including ruthless message dropping and coalescing.
Read original on Dev.to #systemdesignThe article recounts a real-world incident where a WebSocket-based live commentary platform, designed with a standard fan-out pattern (Redis Pub/Sub to Go services broadcasting to clients), encountered severe memory issues and node crashes when scaling from 100,000 to 1 million users. The core problem was identified as backpressure failure caused by slow consumers rather than a general traffic spike or CPU bottleneck, leading to gigabytes of stale data accumulating in memory.
The initial design was straightforward: a writer service published score updates to Redis Pub/Sub, and a fleet of Go services subscribed to Redis, then fanned out these updates to connected WebSockets. Each client was given a buffered channel (256 messages) to absorb transient network lags. This design, while simple and effective at lower scales, became a "RAM hoarder" as the user base grew.
type Client struct {
hub *Hub
conn *websocket.Conn
send chan []byte // Buffered channel: 256 slots
}
func (c *Client) writePump() {
for {
select {
case message, ok := <-c.send:
if !ok { c.conn.WriteMessage(websocket.CloseMessage, []byte{}); return }
// This Write() call to the WebSocket connection can block if the client is slow.
// Meanwhile, the 'send' channel continues to fill up.
w, err := c.conn.NextWriter(websocket.TextMessage)
if err != nil { return }
w.Write(message)
}
}
}The Network is Not a Uniform Pipe
The critical misconception was treating the internet as a uniform, fast pipe. Load tests with high-bandwidth clients masked the real-world impact of users on slow or unstable networks (e.g., 3G in crowded stadiums). When a client's TCP send buffer filled due to slow ACKs, the application-level `Write` call would block. With buffered Go channels, this meant messages would queue up in memory, consuming significant RAM for users who were already far behind real-time data.
The math was stark: 200,000 slow consumers, each with a 256-message buffer of ~1KB payloads, amounted to over 51GB of queued messages across the cluster, leading to OOM kills on 16GB nodes. The system was trying to save every packet for every user, ultimately crashing the platform for everyone.
These changes resulted in handling 1.2 million concurrent users with stable memory usage and improved latency, demonstrating that effective WebSocket scaling is fundamentally about managing queues and backpressure, not just opening more sockets.