Menu
Dev.to #systemdesign·April 1, 2026

Designing a Multi-Tenant WebSocket Service: Lessons from Building Apinator

This article details the architectural decisions and challenges in building a multi-tenant WebSocket service, Apinator, as an alternative to hosted real-time providers. It emphasizes the complexities beyond basic WebSocket communication, focusing on aspects like authentication, presence tracking, fanout, webhook delivery, and usage metering. The author shares insights into separating control and data planes, leveraging Redis for the hot path, and critical design choices for a robust real-time platform.

Read original on Dev.to #systemdesign

Building real-time features, while seemingly straightforward, quickly reveals significant architectural complexities. Beyond establishing a basic WebSocket connection, a production-grade real-time platform must reliably handle aspects like secure private channels, accurate user presence across multiple devices, efficient cross-node event fanout, resilient webhook delivery, and robust usage accounting for multi-tenant environments. The decision to build a custom solution over using a hosted provider often arises from the need for greater control over infrastructure, cost predictability, and tailored multi-tenant behavior.

Core Architectural Separation: Control Plane vs. Data Plane

A key architectural decision in building Apinator was the clear separation of concerns into a control plane and a data plane. This split is fundamental for scalability and reliability:

  • Control Plane: Manages administrative functions such as accounts, application configurations, and management APIs. This plane typically interacts with persistent storage like PostgreSQL, where consistency is paramount.
  • Data Plane: Handles the high-throughput, low-latency operations like WebSocket connections, event publishing, channel authentication, and real-time fanout. It is designed to be lean, with minimal dependencies on the critical path, primarily relying on Redis for fast data access and messaging.

Key Design Decisions for a Robust Real-time System

  • Redis for Fanout: Utilizes Redis Pub/Sub for efficient cross-node event broadcasting. This approach is crucial for horizontal scalability, ensuring events reach all subscribed clients across various instances. Reference-counted subscriptions further optimize resource usage by unsubscribing from channels with no active clients.
  • HMAC Authentication: Implements server-side HMAC (Hash-based Message Authentication Code) for private and presence channel subscriptions. This prevents clients from arbitrarily subscribing to restricted channels, ensuring robust security. The flow involves client subscription requests, backend user authentication, backend signing of the channel subscription, and WebSocket service validation of the signature.
  • Complex Presence Tracking: Acknowledges that presence is more than a simple list of online users. It involves handling connection lifecycle edge cases like reconnects, multiple tabs, network fluctuations, and defining when a user is truly 'present' to avoid misleading counts.
  • Transactional Webhook Delivery: Moves beyond simple HTTP POST requests for webhooks to a more resilient delivery model. This often involves a transactional outbox pattern and a dedicated worker with retry logic and exponential backoff, addressing issues like destination downtime, timeouts, and error responses, while also incorporating signature verification.
  • Early Integration of Usage Limits: Emphasizes that usage accounting and rate limiting are not add-on features for multi-tenant platforms but rather core design considerations. Integrating them early ensures they shape the system's architecture, particularly around atomic counters and avoiding bottlenecks in hot paths.
💡

Operational Considerations

The article highlights that many real-time system bugs stem from complex connection lifecycle edge cases (reconnects, dropped sockets, race conditions during auth) rather than initial connection flows. It also stresses the importance of defining operational metrics early and writing failure-mode tests to build a truly resilient system.

WebSocketsReal-timeMulti-tenancyRedisMicroservicesAuthenticationFanoutSystem Architecture

Comments

Loading comments...