Menu
Course/Real-World Case Studies/Design a Chat System (WhatsApp)

Design a Chat System (WhatsApp)

Real-time messaging: WebSocket connections, message delivery guarantees, group chats, presence indicators, media handling, and end-to-end encryption.

25 min readHigh interview weight

Problem Statement

A chat system allows users to send and receive messages in real time β€” 1:1 and group β€” with guarantees around delivery, ordering, and presence. At scale (WhatsApp handles 100 B messages/day), the core challenge is maintaining persistent, low-latency connections across billions of devices while ensuring no message is ever lost.

Requirements

FunctionalNon-Functional
1:1 and group messaging (up to 500 members)< 100 ms message delivery (P99)
Message delivery receipts (sent, delivered, read)99.99% availability
Online/offline presence indicatorsMessages durable even when recipient is offline
Media attachments (images, video, audio)Support 1 B+ daily active users
Message history / paginationEnd-to-end encryption

Capacity Estimation

MetricEstimate
Daily Active Users500 M
Messages / day100 B (200 msg/user/day)
Messages / second (peak 2x avg)~2.3 M msg/sec
Avg message size100 bytes (text) + metadata
Storage / day (text only)~10 TB/day
Active WebSocket connections~500 M concurrent

High-Level Architecture

Loading diagram...
Chat system high-level architecture

WebSocket Connection Management

Unlike HTTP request/response, chat requires persistent bidirectional connections. WebSockets are the standard choice. Each Chat Server maintains an in-memory map of `userId β†’ WebSocket connection`. When a message arrives for a recipient, the system must find which Chat Server holds that user's connection β€” this is solved with a routing layer.

ℹ️

Connection Routing

A user's connection is registered in a Redis hash: `chat:conn:{userId} β†’ serverId`. When delivering a message, the sending server looks up the recipient's server ID and routes via the message queue. If no entry exists, the user is offline β€” trigger a push notification instead.

Message Delivery Guarantees

WhatsApp uses a three-state delivery model: sent (server received), delivered (device received), read (user opened). This requires an explicit ACK protocol:

  1. Sender sends message β†’ Chat Server assigns a `messageId` and persists to Cassandra.
  2. Server publishes to Kafka topic for recipient's shard.
  3. Recipient's Chat Server consumes and delivers over WebSocket.
  4. Device ACKs delivery β†’ server updates message status to `delivered`.
  5. User opens conversation β†’ device sends read receipt β†’ status updates to `read`.
Loading diagram...
End-to-end message delivery with receipts

Group Messaging

Group messages introduce a fan-out problem: one message must be delivered to N recipients. Two strategies exist:

StrategyMechanismBest For
Fan-out on writeCopy message to each member's inbox at send timeSmall groups (< 100 members)
Fan-out on readStore one copy; each client pulls on openLarge groups (100–500+ members)

WhatsApp uses a hybrid: fan-out on write for small groups via Kafka (one partition per group), and a pull model for very large groups. The group membership list is stored in ZooKeeper or a dedicated Group Service backed by MySQL.

Presence Service

Presence (online/offline/last seen) is managed by a dedicated service backed by Redis. When a device connects, it sends a heartbeat every 5 seconds. If no heartbeat is received for 30 seconds, the user is marked offline. The key challenge is thundering herd: if a celebrity goes online, millions of subscribers want to be notified. Solutions include fanout-on-write to a queue, or delivering presence lazily (only when the subscriber opens the conversation).

Message Storage: Why Cassandra?

Cassandra is ideal for message storage because: (1) it supports time-series workloads natively with clustering keys; (2) it is linearly scalable; (3) it offers tunable consistency. The schema uses `(userId, conversationId)` as the partition key and `messageId` as the clustering key, enabling efficient range scans for pagination.

sql
-- Cassandra CQL schema (simplified)
CREATE TABLE messages (
  conversation_id UUID,
  message_id      TIMEUUID,    -- time-ordered UUID for sorting
  sender_id       UUID,
  content         TEXT,
  media_url       TEXT,
  status          TEXT,        -- 'sent' | 'delivered' | 'read'
  PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Scaling Considerations

  • Horizontal Chat Servers: stateless except for in-memory connection map; scale independently behind a connection-aware load balancer (consistent hashing by userId).
  • Kafka partitioning: partition by `conversationId` to maintain message ordering within a conversation.
  • Media via CDN: never serve media through Chat Servers β€” upload to S3, serve via CloudFront. Generate pre-signed URLs for secure access.
  • End-to-end encryption: generate key pairs on device; server never sees plaintext. Use Signal Protocol (Double Ratchet + X3DH key agreement).
  • Rate limiting: enforce per-user message rate limits at the Chat Server layer to prevent spam.
πŸ’‘

Interview Tip

This problem often trips candidates who try to use HTTP polling or SSE instead of WebSockets. Be explicit about WHY WebSockets: bidirectional, persistent, low-overhead framing. Also state that the connection map (userId β†’ server) must live in Redis, not locally, otherwise you can't route cross-server messages.

πŸ“

Knowledge Check

5 questions

Test your understanding of this lesson. Score 70% or higher to complete.

Ask about this lesson

Ask anything about Design a Chat System (WhatsApp)