Design a Chat System (WhatsApp)
Real-time messaging: WebSocket connections, message delivery guarantees, group chats, presence indicators, media handling, and end-to-end encryption.
Problem Statement
A chat system allows users to send and receive messages in real time — 1:1 and group — with guarantees around delivery, ordering, and presence. At scale (WhatsApp handles 100 B messages/day), the core challenge is maintaining persistent, low-latency connections across billions of devices while ensuring no message is ever lost.
Requirements
| Functional | Non-Functional |
|---|---|
| 1:1 and group messaging (up to 500 members) | < 100 ms message delivery (P99) |
| Message delivery receipts (sent, delivered, read) | 99.99% availability |
| Online/offline presence indicators | Messages durable even when recipient is offline |
| Media attachments (images, video, audio) | Support 1 B+ daily active users |
| Message history / pagination | End-to-end encryption |
Capacity Estimation
| Metric | Estimate |
|---|---|
| Daily Active Users | 500 M |
| Messages / day | 100 B (200 msg/user/day) |
| Messages / second (peak 2x avg) | ~2.3 M msg/sec |
| Avg message size | 100 bytes (text) + metadata |
| Storage / day (text only) | ~10 TB/day |
| Active WebSocket connections | ~500 M concurrent |
High-Level Architecture
WebSocket Connection Management
Unlike HTTP request/response, chat requires persistent bidirectional connections. WebSockets are the standard choice. Each Chat Server maintains an in-memory map of `userId → WebSocket connection`. When a message arrives for a recipient, the system must find which Chat Server holds that user's connection — this is solved with a routing layer.
Connection Routing
A user's connection is registered in a Redis hash: `chat:conn:{userId} → serverId`. When delivering a message, the sending server looks up the recipient's server ID and routes via the message queue. If no entry exists, the user is offline — trigger a push notification instead.
Message Delivery Guarantees
WhatsApp uses a three-state delivery model: sent (server received), delivered (device received), read (user opened). This requires an explicit ACK protocol:
- Sender sends message → Chat Server assigns a `messageId` and persists to Cassandra.
- Server publishes to Kafka topic for recipient's shard.
- Recipient's Chat Server consumes and delivers over WebSocket.
- Device ACKs delivery → server updates message status to `delivered`.
- User opens conversation → device sends read receipt → status updates to `read`.
Group Messaging
Group messages introduce a fan-out problem: one message must be delivered to N recipients. Two strategies exist:
| Strategy | Mechanism | Best For |
|---|---|---|
| Fan-out on write | Copy message to each member's inbox at send time | Small groups (< 100 members) |
| Fan-out on read | Store one copy; each client pulls on open | Large groups (100–500+ members) |
WhatsApp uses a hybrid: fan-out on write for small groups via Kafka (one partition per group), and a pull model for very large groups. The group membership list is stored in ZooKeeper or a dedicated Group Service backed by MySQL.
Presence Service
Presence (online/offline/last seen) is managed by a dedicated service backed by Redis. When a device connects, it sends a heartbeat every 5 seconds. If no heartbeat is received for 30 seconds, the user is marked offline. The key challenge is thundering herd: if a celebrity goes online, millions of subscribers want to be notified. Solutions include fanout-on-write to a queue, or delivering presence lazily (only when the subscriber opens the conversation).
Message Storage: Why Cassandra?
Cassandra is ideal for message storage because: (1) it supports time-series workloads natively with clustering keys; (2) it is linearly scalable; (3) it offers tunable consistency. The schema uses `(userId, conversationId)` as the partition key and `messageId` as the clustering key, enabling efficient range scans for pagination.
-- Cassandra CQL schema (simplified)
CREATE TABLE messages (
conversation_id UUID,
message_id TIMEUUID, -- time-ordered UUID for sorting
sender_id UUID,
content TEXT,
media_url TEXT,
status TEXT, -- 'sent' | 'delivered' | 'read'
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);Scaling Considerations
- Horizontal Chat Servers: stateless except for in-memory connection map; scale independently behind a connection-aware load balancer (consistent hashing by userId).
- Kafka partitioning: partition by `conversationId` to maintain message ordering within a conversation.
- Media via CDN: never serve media through Chat Servers — upload to S3, serve via CloudFront. Generate pre-signed URLs for secure access.
- End-to-end encryption: generate key pairs on device; server never sees plaintext. Use Signal Protocol (Double Ratchet + X3DH key agreement).
- Rate limiting: enforce per-user message rate limits at the Chat Server layer to prevent spam.
Interview Tip
This problem often trips candidates who try to use HTTP polling or SSE instead of WebSockets. Be explicit about WHY WebSockets: bidirectional, persistent, low-overhead framing. Also state that the connection map (userId → server) must live in Redis, not locally, otherwise you can't route cross-server messages.
Practice this pattern
Design a real-time messaging app like WhatsApp