This article details Netflix's architectural shift in moving data from Apache Cassandra to Apache Iceberg for analytics, addressing the limitations of their legacy system, Casspactor. It highlights the transition from a monolithic, metadata-dependent system to a layered, Spark-based architecture that directly processes S3 backups, improving reliability, scalability, and cost efficiency. The post also outlines a robust migration strategy emphasizing validation, visibility, and safety.
Read original on Netflix Tech BlogNetflix heavily relies on Apache Cassandra for mission-critical applications and uses data movement to Apache Iceberg for various analytics and operational tasks. Initially, an in-house connector called Casspactor handled this, processing ~1,200 data movements and ~3 PB of data daily. However, this monolithic system faced significant architectural challenges as data scale and complexity grew, necessitating a redesign to improve reliability, performance, and maintainability.
Netflix redesigned its data movement system around a layered architecture using Apache Spark and leveraging direct S3 backup access. The core innovation is the Cassandra Analytics Wrapper, which reads raw data directly from S3 backups and translates it into standard Spark DataFrames. On top of this foundation, a "Connector Factory" model allows individual data abstractions (e.g., Key Value, Time Series) to build highly optimized, data-model-aware connectors using Java UDFs and transforms, eliminating post-processing.
Key Architectural Improvements
The new stack fundamentally shifts from relying on external, inconsistent metadata to using S3 as the single source of truth for backups. Processing is moved to Spark executors, directly producing DataFrames and enabling model-aware transformations at the source, significantly reducing complexity and improving efficiency.
The migration from Casspactor to the new stack was executed using a "Like-for-Like" strategy that preserved the user-facing interface, output contract, and final data artifact. This meant downstream teams required no code changes or validation, transforming a distributed, high-risk effort into an internal platform implementation detail. The migration was guided by three pillars: Validation (shadow testing, row-by-row data consistency checks), Visibility (real-time monitoring of progress and health), and Safety (minimizing user impact through abstractions and fallbacks).