Reddit successfully migrated its entire Apache Kafka fleet, consisting of over 500 brokers and more than a petabyte of live data, from Amazon EC2 to Kubernetes. This massive undertaking was completed without any downtime, data loss, or requiring client applications to modify their connection configurations. This case study demonstrates how careful planning, incremental steps, and strategic use of abstraction layers can facilitate complex infrastructure changes in highly available systems.
Key Migration Constraints and Architectural Implications
The migration was shaped by four critical constraints, each dictating specific architectural approaches and ruling out others:
- Zero Downtime/Data Loss: No maintenance window was acceptable, ruling out strategies like scheduled cutovers or dual-writes that inherently involve downtime or data resynchronization.
- Preserve Kafka Metadata: Kafka's internal state (metadata managed by ZooKeeper) could not be rebuilt from scratch, requiring new brokers to join the existing cluster.
- Client Connectivity: Client applications were tightly coupled to specific broker hostnames, preventing simple broker replacements without breaking hundreds of services. Reddit also lacked control over client discovery mechanisms.
- Reversibility: Every step had to be reversible to allow for recovery from unforeseen issues, necessitating a prolonged period of mixed EC2 and Kubernetes broker operation.
Phased Migration Strategy
Reddit's migration was meticulously executed in distinct phases:
- Phase 1: Taking Control of the Naming Layer. A DNS facade was introduced, acting as an abstraction layer between clients and Kafka brokers. This allowed Reddit to control where client DNS records pointed without modifying client application code, a crucial step for decoupling.
- Phase 2: Making Room for New Brokers. Existing EC2 broker IDs were high, conflicting with Strimzi's default low IDs. Reddit temporarily doubled the cluster size with new EC2 brokers having higher IDs, then decommissioned the original low-ID brokers, freeing up the necessary ID space for Kubernetes brokers.
- Phase 3: Running a Mixed Cluster. This involved a modified Strimzi operator to allow Kubernetes-managed Kafka brokers to join the existing EC2-based cluster. Customizations ensured inter-broker communication across environments and shared ZooKeeper metadata.
- Phase 4: Gradually Shifting Data and Traffic. Using Kafka's Cruise Control tool, partition leadership and replicated data were incrementally moved from EC2 to Kubernetes brokers. This phase was continuously monitored and designed to be reversible.
- Phase 5: Migrating the Control Plane. After all data and traffic were on Kubernetes, the control plane was migrated from ZooKeeper to KRaft, Kafka's built-in metadata system, leveraging documented procedures. This separation of concerns minimized risk.
- Phase 6: Cleaning Up. The custom Strimzi operator was replaced with the standard version, and EC2 infrastructure was decommissioned.
💡System Design Lessons Learned
Abstraction Layers: Introducing an abstraction (like DNS facade) between clients and infrastructure is vital for decoupling and enabling seamless migrations. It allows changes on one side without impacting the other. Protecting Logical State: Treat metadata and logical state as the core assets to protect, and infrastructure as replaceable. Reversibility: Designing for reversibility in every step increases confidence and reduces risk in large-scale changes.