This article details Veltrix's architectural evolution to support 20,000 concurrent players in a real-time treasure hunt, focusing on overcoming unbounded state issues from long-lived WebSocket connections. The solution involved a two-tier architecture, separating ephemeral WebSocket handling from stateful processing using Rust, Kafka, and RocksDB to significantly reduce memory footprint and improve stability.
Read original on Dev.to #architectureVeltrix faced significant scalability issues in their real-time treasure hunt game due to unbounded state associated with 20,000 concurrent, long-lived WebSocket connections. Each connection consumed approximately 2.3 MB for TCP reassembly buffers, quickly exhausting kernel resources (e.g., hitting `somaxconn` limits) and leading to connection failures. Initial attempts to mitigate this, such as enabling `reuseport` or setting `SO_KEEPALIVE`, failed because they didn't address the fundamental problem of persistent WebSocket state or inadvertently broke game logic by prematurely closing active player sessions.
The core architectural decision was to split the WebSocket layer into two distinct tiers to decouple real-time communication from long-lived player state:
Key System Design Takeaway
Decoupling ephemeral communication channels from long-lived application state is a powerful pattern for scaling real-time systems. This allows the communication layer to remain lightweight and highly scalable, while state management can be handled by specialized, resilient services.
RocksDB was chosen for its ability to maintain a time-windowed state. Its compaction filter was configured to automatically delete keys older than 30 minutes, keeping the database size manageable even with 20,000 active sessions. While this involved a trade-off from Redis's O(1) lookups to iterator-based range scans, the observed throughput hit was negligible for their use case.
A crucial lesson learned was the importance of separating TTLs for different types of state. Initially, beacon state and player session TTLs were conflated within the same RocksDB key. A recommended improvement was to split these stores: one for live sessions (30 min TTL) and another for audit logs (90 days), potentially leveraging RocksDB's `SstFileManager` to store audit logs on S3. This would avoid issues like the 4.7-second cache rebuild during worker restarts and highlight the principle: never let a real-time protocol own long-lived state that can be derived.