Menu
Course/Real-World Case Studies/Design a Notification System

Design a Notification System

Multi-channel notifications (push, email, SMS, in-app): template management, user preferences, delivery tracking, rate limiting, and priority handling.

20 min readHigh interview weight

Problem Statement

A notification system delivers messages to users across multiple channels — push notifications (mobile/desktop), email, SMS, and in-app — based on events from other services. The challenges are: routing to the right channel, respecting user preferences, handling third-party delivery failures gracefully, and not spamming users (rate limiting). At scale, a system like Facebook's processes billions of notifications per day.

Requirements

FunctionalNon-Functional
Send push, email, SMS, and in-app notifications1 B notifications/day
User preference management (opt-in/out per channel and type)Critical alerts delivered in < 5 seconds
Template-based messages with variable substitutionAt-least-once delivery guarantee
Priority levels: critical, high, normal, low99.9% delivery success rate
Delivery status tracking and retry on failureNo duplicate notifications to users
Scheduled notificationsRate limiting: max 10 push/hour per user

High-Level Architecture

Loading diagram...
Notification system high-level architecture

Message Flow

Loading diagram...
End-to-end notification delivery flow

Channel Types and Third-Party Providers

ChannelProvider ExamplesLatencyCostReliability
Push (iOS)Apple APNs< 1 secFreeHigh (~99%)
Push (Android)Firebase FCM< 1 secFreeHigh (~99%)
EmailAmazon SES, SendGrid1-30 secLow ($0.0001/email)Medium (98%)
SMSTwilio, AWS SNS1-10 secHigh ($0.0075/SMS)Medium (97%)
In-AppCustom (Redis + WebSocket)< 100 msInfrastructure costHigh (controlled)

User Preferences

Every notification must respect user preferences. A preference matrix maps `(userId, notificationType, channel)` → `enabled`. This is stored in MySQL for durability and cached in Redis for low-latency lookup. Example schema:

sql
CREATE TABLE notification_preferences (
  user_id    BIGINT NOT NULL,
  notif_type VARCHAR(64) NOT NULL,   -- e.g., 'order_update', 'marketing', 'chat'
  channel    VARCHAR(16) NOT NULL,   -- 'push', 'email', 'sms', 'in_app'
  enabled    BOOLEAN DEFAULT TRUE,
  updated_at TIMESTAMP DEFAULT NOW(),
  PRIMARY KEY (user_id, notif_type, channel)
);

-- Cache key pattern: prefs:{userId}:{notif_type} → JSON blob of channel flags
-- TTL: 5 minutes (user changes are eventually consistent)

Priority Queues and Rate Limiting

Not all notifications are equal. Use separate Kafka topics by priority: `notif.critical`, `notif.high`, `notif.normal`, `notif.low`. Workers are assigned to topic groups with critical having the most consumers. Rate limiting prevents notification fatigue:

  • Per-user rate limit: max 10 push notifications per hour per user (token bucket in Redis).
  • Marketing vs transactional: transactional (order confirmed, payment failed) bypass rate limits; marketing is rate-limited.
  • Quiet hours: respect user time zones — suppress non-critical notifications during user-defined quiet hours (e.g., 11 PM – 7 AM).
  • Deduplication: assign each notification a `notifId` (UUID); check Redis for duplicate before sending. TTL: 24 hours.

Retry and Dead Letter Queue

Third-party providers (FCM, Twilio) can fail temporarily. Implement exponential backoff retries with a cap (e.g., 3 attempts: immediately, 1 min, 5 min). After all retries are exhausted, send to a Dead Letter Queue (DLQ) for monitoring and manual inspection. Alert on DLQ depth exceeding a threshold.

⚠️

Idempotency Is Critical

With at-least-once delivery semantics (Kafka), a notification worker may process the same event twice. Without deduplication, users receive duplicate notifications. Always check a short-lived Redis key (`notif:{notifId}:sent`) before dispatching. Use `SET NX EX 86400` to atomically check and set in one operation.

Template System

Notification content is driven by templates with variable substitution. Templates are stored in a database (MySQL) and cached in Redis. Example template for an order confirmation:

text
Template ID: order_confirmed_push
Channel: push
Subject: "Order Confirmed!"
Body: "Hi {{firstName}}, your order #{{orderId}} has been confirmed.
       Estimated delivery: {{deliveryDate}}."

Runtime rendering:
  variables = { firstName: "Alice", orderId: "ORD-12345", deliveryDate: "Feb 22" }
  rendered  = "Hi Alice, your order #ORD-12345 has been confirmed.
               Estimated delivery: Feb 22."

Scaling Considerations

  • Kafka partitioning: partition `notif.normal` by `userId` to maintain per-user ordering and avoid multiple workers handling the same user simultaneously (prevents duplicate sends).
  • Worker auto-scaling: scale workers based on Kafka consumer lag. Lag > 10K messages → add workers. Critical topic always has minimum N workers.
  • Connection pooling to third-party APIs: FCM and SES have connection limits. Use an HTTP connection pool per worker to maximize throughput.
  • In-app notification storage: store in Cassandra (`(userId, timestamp)` partition) for inbox functionality. Unread count in Redis counter.
  • Delivery analytics: track open rates, click rates, unsubscribes — essential for tuning rate limits and detecting deliverability issues.
💡

Interview Tip

This problem is deceptively simple — don't underestimate it. The three concepts interviewers want to see are: (1) decoupled event-driven architecture (services publish events, notification workers consume), (2) user preference lookup with caching, and (3) idempotency/deduplication to prevent duplicate sends. Bonus points for mentioning quiet hours and the DLQ pattern.

📝

Knowledge Check

5 questions

Test your understanding of this lesson. Score 70% or higher to complete.

Ask about this lesson

Ask anything about Design a Notification System