Menu
Netflix Tech Blog·June 19, 2026

Netflix's Evolution of Cassandra Data Movement to Iceberg

This article details Netflix's architectural shift in moving data from Apache Cassandra to Apache Iceberg for analytics, addressing the limitations of their legacy system, Casspactor. It highlights the transition from a monolithic, metadata-dependent system to a layered, Spark-based architecture that directly processes S3 backups, improving reliability, scalability, and cost efficiency. The post also outlines a robust migration strategy emphasizing validation, visibility, and safety.

Read original on Netflix Tech Blog

The Challenge: Scaling Cassandra Data Movement

Netflix heavily relies on Apache Cassandra for mission-critical applications and uses data movement to Apache Iceberg for various analytics and operational tasks. Initially, an in-house connector called Casspactor handled this, processing ~1,200 data movements and ~3 PB of data daily. However, this monolithic system faced significant architectural challenges as data scale and complexity grew, necessitating a redesign to improve reliability, performance, and maintainability.

Limitations of the Legacy Casspactor Architecture

  • Fragile Metadata Dependencies: Casspactor relied on multiple independent systems to ascertain backup status and content. This composite view was prone to divergence and inconsistencies, leading to stale or incorrect data reads. Uncoordinated Cassandra snapshots during maintenance often broke data movement.
  • Inherited Limitations for Data Abstractions: As Cassandra backed various higher-level data abstractions (Key Value, Time Series, Graph), Casspactor's lack of data model awareness forced each abstraction to bolt on complex, costly post-processing steps. This led to issues like skewed partition failures (out-of-memory errors on large datasets), intermediate table bloat (significant storage overhead), and inability to "time travel" due to schema/topology changes.
  • Monolithic Design: Built as a single connector, Casspactor lacked an engine-like foundation, preventing the development of a family of purpose-built connectors on a shared, robust base.

The New Stack: A Layered, Spark-Based Architecture

Netflix redesigned its data movement system around a layered architecture using Apache Spark and leveraging direct S3 backup access. The core innovation is the Cassandra Analytics Wrapper, which reads raw data directly from S3 backups and translates it into standard Spark DataFrames. On top of this foundation, a "Connector Factory" model allows individual data abstractions (e.g., Key Value, Time Series) to build highly optimized, data-model-aware connectors using Java UDFs and transforms, eliminating post-processing.

ℹ️

Key Architectural Improvements

The new stack fundamentally shifts from relying on external, inconsistent metadata to using S3 as the single source of truth for backups. Processing is moved to Spark executors, directly producing DataFrames and enabling model-aware transformations at the source, significantly reducing complexity and improving efficiency.

  • Handles Skewed Partitions: Mutation compaction and processing are moved to Spark executors, efficiently handling wide partitions without excessive data shuffling or out-of-memory errors.
  • Operates at Spark DataFrames: Eliminates costly intermediate Iceberg tables, reducing storage bloat and operational complexity, and providing a universal interface for connectors.
  • Reduced Dependencies & Time Travel: Directly reading metadata from S3 backups makes S3 the authoritative source of truth, improving reliability and enabling robust time travel functionality by cohesively processing schema, topology, and data at specific points in time.
  • Performance & Cost Savings: Collective improvements led to significant reductions in execution runtime and storage/compute footprint, resulting in multi-million dollar cost savings.

Strategic Migration for Zero Downtime

The migration from Casspactor to the new stack was executed using a "Like-for-Like" strategy that preserved the user-facing interface, output contract, and final data artifact. This meant downstream teams required no code changes or validation, transforming a distributed, high-risk effort into an internal platform implementation detail. The migration was guided by three pillars: Validation (shadow testing, row-by-row data consistency checks), Visibility (real-time monitoring of progress and health), and Safety (minimizing user impact through abstractions and fallbacks).

CassandraApache IcebergData MovementSparkS3Distributed Data ProcessingMigration StrategyNetflix

Comments

Loading comments...
Netflix's Evolution of Cassandra Data Movement to Iceberg | SysDesAi