This article discusses the architectural challenges of implementing a robust, centralized data deletion platform across Netflix's highly distributed data landscape. It delves into the complexities of ensuring durability, availability, and correctness during data deletion, especially when data is replicated and transformed across various datastores. The authors highlight the need for a comprehensive strategy to propagate deletes, manage tombstone accumulation, and build auditability without impacting live traffic.
Read original on InfoQ ArchitectureDeleting data reliably in a large-scale, distributed environment like Netflix is far from trivial. It's a critical operation that balances durability (data stays deleted), availability (system remains operational), and correctness (no unintended deletions or lingering data). Accidental deletions can cause severe production incidents, while failing to delete data incurs storage costs and erodes customer trust. The core problem is that data often exists in multiple locations—primary datastores, caches, search indexes, backups—and each system handles deletions differently.
Datastores employ diverse mechanisms for data deletion, each with its own trade-offs in terms of performance, operational risk, and cost. Understanding these variations is crucial for designing a coherent deletion strategy:
Hidden Costs and 'Ghosts' in the System
Deletion operations, especially background processes like compaction and vacuuming, consume significant CPU, memory, and I/O resources, leading to performance degradation, increased latencies (due to scanning tombstones), and even read timeouts. Furthermore, misconfigurations or operational errors can lead to "data resurrection," where deleted data reappears due to inconsistent state across replicas or delayed cleanup processes.
A major architectural challenge arises from data replication and transformation. Data often flows from a primary source (e.g., Cassandra) to secondary systems for indexing (Elasticsearch), caching (EVCache), and backup (S3). Simply deleting from the source is insufficient; it leads to dangling pointers and stale data in downstream systems. The recommended approach for a centralized deletion platform is a fanout strategy, where a delete operation from the source asynchronously propagates to every copy across all relevant downstream systems to ensure no stale or orphaned data persists.