Menu
Course/Real-World Case Studies/Design a Chat System (WhatsApp)

Design a Chat System (WhatsApp)

Real-time messaging: WebSocket connections, message delivery guarantees, group chats, presence indicators, media handling, and end-to-end encryption.

25 min readHigh interview weight

Problem Statement

A chat system allows users to send and receive messages in real time — 1:1 and group — with guarantees around delivery, ordering, and presence. At scale (WhatsApp handles 100 B messages/day), the core challenge is maintaining persistent, low-latency connections across billions of devices while ensuring no message is ever lost.

Requirements

FunctionalNon-Functional
1:1 and group messaging (up to 500 members)< 100 ms message delivery (P99)
Message delivery receipts (sent, delivered, read)99.99% availability
Online/offline presence indicatorsMessages durable even when recipient is offline
Media attachments (images, video, audio)Support 1 B+ daily active users
Message history / paginationEnd-to-end encryption

Capacity Estimation

MetricEstimate
Daily Active Users500 M
Messages / day100 B (200 msg/user/day)
Messages / second (peak 2x avg)~2.3 M msg/sec
Avg message size100 bytes (text) + metadata
Storage / day (text only)~10 TB/day
Active WebSocket connections~500 M concurrent

High-Level Architecture

Loading diagram...
Chat system high-level architecture

WebSocket Connection Management

Unlike HTTP request/response, chat requires persistent bidirectional connections. WebSockets are the standard choice. Each Chat Server maintains an in-memory map of `userId → WebSocket connection`. When a message arrives for a recipient, the system must find which Chat Server holds that user's connection — this is solved with a routing layer.

ℹ️

Connection Routing

A user's connection is registered in a Redis hash: `chat:conn:{userId} → serverId`. When delivering a message, the sending server looks up the recipient's server ID and routes via the message queue. If no entry exists, the user is offline — trigger a push notification instead.

Message Delivery Guarantees

WhatsApp uses a three-state delivery model: sent (server received), delivered (device received), read (user opened). This requires an explicit ACK protocol:

  1. Sender sends message → Chat Server assigns a `messageId` and persists to Cassandra.
  2. Server publishes to Kafka topic for recipient's shard.
  3. Recipient's Chat Server consumes and delivers over WebSocket.
  4. Device ACKs delivery → server updates message status to `delivered`.
  5. User opens conversation → device sends read receipt → status updates to `read`.
Loading diagram...
End-to-end message delivery with receipts

Group Messaging

Group messages introduce a fan-out problem: one message must be delivered to N recipients. Two strategies exist:

StrategyMechanismBest For
Fan-out on writeCopy message to each member's inbox at send timeSmall groups (< 100 members)
Fan-out on readStore one copy; each client pulls on openLarge groups (100–500+ members)

WhatsApp uses a hybrid: fan-out on write for small groups via Kafka (one partition per group), and a pull model for very large groups. The group membership list is stored in ZooKeeper or a dedicated Group Service backed by MySQL.

Presence Service

Presence (online/offline/last seen) is managed by a dedicated service backed by Redis. When a device connects, it sends a heartbeat every 5 seconds. If no heartbeat is received for 30 seconds, the user is marked offline. The key challenge is thundering herd: if a celebrity goes online, millions of subscribers want to be notified. Solutions include fanout-on-write to a queue, or delivering presence lazily (only when the subscriber opens the conversation).

Message Storage: Why Cassandra?

Cassandra is ideal for message storage because: (1) it supports time-series workloads natively with clustering keys; (2) it is linearly scalable; (3) it offers tunable consistency. The schema uses `(userId, conversationId)` as the partition key and `messageId` as the clustering key, enabling efficient range scans for pagination.

sql
-- Cassandra CQL schema (simplified)
CREATE TABLE messages (
  conversation_id UUID,
  message_id      TIMEUUID,    -- time-ordered UUID for sorting
  sender_id       UUID,
  content         TEXT,
  media_url       TEXT,
  status          TEXT,        -- 'sent' | 'delivered' | 'read'
  PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Scaling Considerations

  • Horizontal Chat Servers: stateless except for in-memory connection map; scale independently behind a connection-aware load balancer (consistent hashing by userId).
  • Kafka partitioning: partition by `conversationId` to maintain message ordering within a conversation.
  • Media via CDN: never serve media through Chat Servers — upload to S3, serve via CloudFront. Generate pre-signed URLs for secure access.
  • End-to-end encryption: generate key pairs on device; server never sees plaintext. Use Signal Protocol (Double Ratchet + X3DH key agreement).
  • Rate limiting: enforce per-user message rate limits at the Chat Server layer to prevent spam.
💡

Interview Tip

This problem often trips candidates who try to use HTTP polling or SSE instead of WebSockets. Be explicit about WHY WebSockets: bidirectional, persistent, low-overhead framing. Also state that the connection map (userId → server) must live in Redis, not locally, otherwise you can't route cross-server messages.

📝

Knowledge Check

5 questions

Test your understanding of this lesson. Score 70% or higher to complete.

Ask about this lesson

Ask anything about Design a Chat System (WhatsApp)