Menu
InfoQ Architecture·June 4, 2026

Architecting a Centralized Platform for Distributed Data Deletion at Netflix

This article discusses the architectural challenges of implementing a robust, centralized data deletion platform across Netflix's highly distributed data landscape. It delves into the complexities of ensuring durability, availability, and correctness during data deletion, especially when data is replicated and transformed across various datastores. The authors highlight the need for a comprehensive strategy to propagate deletes, manage tombstone accumulation, and build auditability without impacting live traffic.

Read original on InfoQ Architecture

The Challenge of Safe Data Deletion in Distributed Systems

Deleting data reliably in a large-scale, distributed environment like Netflix is far from trivial. It's a critical operation that balances durability (data stays deleted), availability (system remains operational), and correctness (no unintended deletions or lingering data). Accidental deletions can cause severe production incidents, while failing to delete data incurs storage costs and erodes customer trust. The core problem is that data often exists in multiple locations—primary datastores, caches, search indexes, backups—and each system handles deletions differently.

Varieties of Data Deletion Mechanisms

Datastores employ diverse mechanisms for data deletion, each with its own trade-offs in terms of performance, operational risk, and cost. Understanding these variations is crucial for designing a coherent deletion strategy:

  • Time-to-Live (TTL): Data is marked to expire after a set period. Systems like Cassandra, DynamoDB, and Redis support native TTLs, but deletion isn't instantaneous; it often involves background compaction or cleanup processes. Elasticsearch and RDS typically require external scheduling or mark-and-sweep approaches.
  • Hard Deletes: Direct delete commands are issued. Cassandra marks data as a tombstone, which is eventually cleaned up. DynamoDB deletes immediately. Caches like EVCache and Redis might free up logical space but often require background merging or compaction to reclaim physical memory.
  • Soft Deletes: Application-level marking of data as 'deleted' with a flag, followed by a separate background process for physical removal. This adds complexity but offers a safety net for recovery.
⚠️

Hidden Costs and 'Ghosts' in the System

Deletion operations, especially background processes like compaction and vacuuming, consume significant CPU, memory, and I/O resources, leading to performance degradation, increased latencies (due to scanning tombstones), and even read timeouts. Furthermore, misconfigurations or operational errors can lead to "data resurrection," where deleted data reappears due to inconsistent state across replicas or delayed cleanup processes.

Propagating Deletes Across Distributed Data Flows

A major architectural challenge arises from data replication and transformation. Data often flows from a primary source (e.g., Cassandra) to secondary systems for indexing (Elasticsearch), caching (EVCache), and backup (S3). Simply deleting from the source is insufficient; it leads to dangling pointers and stale data in downstream systems. The recommended approach for a centralized deletion platform is a fanout strategy, where a delete operation from the source asynchronously propagates to every copy across all relevant downstream systems to ensure no stale or orphaned data persists.

data deletiondistributed systemsdata consistencynetflixdata lifecycletombstonesdata managementarchitectural patterns

Comments

Loading comments...