Menu
Dev.to #systemdesign·February 28, 2026

Designing a Resilient National Vaccine Appointment System

This article details the system design for a national vaccine appointment and administration system, focusing on overcoming race conditions, stock mismatches, and external API failures. It emphasizes a multi-stage reservation pattern, leveraging temporary holds, rigorous eligibility checks, and comprehensive rollback strategies. The design highlights the importance of resilience and handling unhappy paths in high-concurrency, resource-constrained environments.

Read original on Dev.to #systemdesign

The Challenge: Beyond the Happy Path

Designing systems that manage limited resources under high demand, such as a national vaccine appointment system, often exposes critical flaws when only considering the "happy path." Initial, naive designs tend to overlook issues like race conditions for the last available slot, stock inconsistencies between booking and administration, late eligibility failures, and the complete lack of rollback mechanisms when transactions fail midway. These scenarios highlight the necessity of robust architectural patterns that account for concurrency, data consistency, and external service dependencies.

💡

Key Learning from the Article

The core takeaway is to shift the design mindset from focusing solely on successful flows to proactively identifying and addressing failure scenarios at every step. This involves designing for concurrency, temporary resource allocation, and clear rollback strategies.

Multi-Stage Reservation Pattern for Resource Management

The article advocates for a multi-stage reservation pattern, similar to e-commerce inventory management or concert ticket booking. This pattern involves: a temporary hold, comprehensive verification, and then final confirmation. This ensures that resources (vaccine slots, doses) are not permanently allocated until all prerequisites are met.

  1. <b>Reserve First (Temporary Hold):</b> When a citizen selects a slot, a temporary reservation is created in a fast, in-memory store like Redis with a TTL (Time-To-Live). This atomically decrements available capacity and sets the appointment status to <code>PENDING</code>. Redis's atomic <code>DECR</code> command and TTL feature are crucial here for performance and automatic cleanup.
  2. <b>Eligibility Verification:</b> While the slot is held, the system performs various checks (age, insurance, medical history, geographic region). If any check fails, the temporary reservation is released, and the slot becomes available again.
  3. <b>Confirm Appointment:</b> If all checks pass, the slot and vaccine stock are permanently decreased in the main database, the appointment status becomes <code>CONFIRMED</code>, and the Redis reservation is cleared. This is the transactional "point of no return."

Handling Failure and Rollback Scenarios

A robust system must explicitly define how to handle failures. The design incorporates: automated no-show processing to release stock, immediate stock release upon user cancellation, and mechanisms for clinic-initiated cancellations with priority rebooking. Critical for distributed systems, the design includes strategies for external API failures (e.g., insurance verification) using circuit breakers and retry queues to prevent cascading failures and ensure system resilience. For internal component failures like Redis going down, a fallback to slower, database-level reservations is considered to maintain system functionality, albeit with degraded performance.

High-Level Architecture and Key Components

The proposed architecture is microservice-based, leveraging several specialized services to manage different aspects of the system:

  • <b>API Gateway:</b> Handles authentication, rate limiting, and routing.
  • <b>Core Services:</b> Separate services for Authentication, Patient, Clinic, Inventory, Appointment (the central orchestration service), Eligibility (rules engine + external APIs), Notification, and Audit.
  • <b>Data Layer:</b> PostgreSQL for persistent data, Redis for temporary reservations and caching.
  • <b>Asynchronous Messaging:</b> Kafka is used for event-driven communication (<code>AppointmentReserved</code>, <code>AppointmentConfirmed</code>, etc.), promoting service decoupling and inherent auditability.
reservation systemconcurrencyrace conditionsrollbackmicroservicesRedisevent-driven architecturecircuit breaker

Comments

Loading comments...