This article details Meta's strategy and solutions for migrating its petabyte-scale data ingestion system, which powers the social graph analytics and ML. It highlights the architectural shift from customer-owned pipelines to a self-managed data warehouse service, emphasizing the rigorous migration lifecycle, data quality validation, and robust rollback mechanisms crucial for ensuring reliability during such a massive transition.
Read original on Meta EngineeringMeta successfully revamped its petabyte-scale data ingestion system, moving from a legacy architecture with customer-owned pipelines to a more efficient, self-managed data warehouse service. This migration was critical to enhance reliability and meet strict data landing time requirements for the social graph, which is one of the largest MySQL deployments globally. The new system efficiently scrapes data for analytics, reporting, and ML model training across the company.
Migrating thousands of data ingestion jobs, each responsible for incrementally scraping petabytes of data, posed significant challenges in ensuring data integrity and operational reliability. Meta established a stringent migration job lifecycle to manage this complexity, defining clear success criteria for each job before it could advance. These criteria included verifying no data quality issues (row count and checksum consistency), no landing latency regression, and no resource utilization regression.
A crucial aspect of the migration was building custom data quality analysis tooling. This system compared row counts and checksums between production and shadow tables, logging any mismatches to Meta's real-time data management system, Scuba. Automated hourly analysis identified example rows causing issues, enabling rapid debugging. Given the Change Data Capture (CDC) nature of the ingestion process, where problematic data can propagate, robust rollout and rollback mechanisms were essential. Early signals after rollout (via backfill comparisons) and a quick stop-the-bleeding mechanism (marking bad partitions in metadata) were implemented to prevent bad data propagation and enable swift fixes.
To handle tens of thousands of jobs, Meta developed automated tooling that monitored job status signals (based on the defined lifecycle criteria) and automatically promoted or demoted jobs through the migration stages. Dashboards provided both system-level and individual job-level visibility. Due to limited capacity, jobs were migrated in batches, prioritized by throughput, business need, and special cases. Careful planning involved excluding jobs with known issues to reduce noise and performing full data dumps to correct snapshots if data quality issues were detected post-migration.
System Design Takeaways
When planning large-scale system migrations, prioritize a phased approach (e.g., shadow/reverse shadow), invest in comprehensive automated monitoring and data quality validation, and design for rapid rollback capabilities. The ability to compare old and new system outputs in parallel and quickly halt bad data propagation is paramount for maintaining reliability.