Meta Engineering·May 12, 2026

Large-Scale Data Ingestion System Migration at Meta

This article details Meta's strategy and solutions for migrating its petabyte-scale data ingestion system, which powers the social graph analytics and ML. It highlights the architectural shift from customer-owned pipelines to a self-managed data warehouse service, emphasizing the rigorous migration lifecycle, data quality validation, and robust rollback mechanisms crucial for ensuring reliability during such a massive transition.

Databases & Storage Distributed Systems Case Studies & Postmortems

Read original on Meta Engineering

Meta successfully revamped its petabyte-scale data ingestion system, moving from a legacy architecture with customer-owned pipelines to a more efficient, self-managed data warehouse service. This migration was critical to enhance reliability and meet strict data landing time requirements for the social graph, which is one of the largest MySQL deployments globally. The new system efficiently scrapes data for analytics, reporting, and ML model training across the company.

The Migration Challenge and Lifecycle

Migrating thousands of data ingestion jobs, each responsible for incrementally scraping petabytes of data, posed significant challenges in ensuring data integrity and operational reliability. Meta established a stringent migration job lifecycle to manage this complexity, defining clear success criteria for each job before it could advance. These criteria included verifying no data quality issues (row count and checksum consistency), no landing latency regression, and no resource utilization regression.

Phase 1: Shadow Phase - New system jobs ran in a pre-production environment, consuming production data but writing to isolated shadow tables. This allowed for identification and fixing of issues with real data, while continuously monitoring for mismatches in row counts and checksums.
Phase 2: Reverse Shadow Phase - The new system's job became the primary writer to the production table, while the old system's job wrote to the shadow table. This provided ongoing data quality signals and enabled rapid rollback without needing to reconfigure the old system.
Phase 3: Migration Cleanup - After continuous monitoring showed no discrepancies, the old system's shadow job was deprecated, and the new system fully took over production traffic.

Ensuring Data Quality and Rollback Capability

A crucial aspect of the migration was building custom data quality analysis tooling. This system compared row counts and checksums between production and shadow tables, logging any mismatches to Meta's real-time data management system, Scuba. Automated hourly analysis identified example rows causing issues, enabling rapid debugging. Given the Change Data Capture (CDC) nature of the ingestion process, where problematic data can propagate, robust rollout and rollback mechanisms were essential. Early signals after rollout (via backfill comparisons) and a quick stop-the-bleeding mechanism (marking bad partitions in metadata) were implemented to prevent bad data propagation and enable swift fixes.

Large-Scale Automation and Capacity Planning

To handle tens of thousands of jobs, Meta developed automated tooling that monitored job status signals (based on the defined lifecycle criteria) and automatically promoted or demoted jobs through the migration stages. Dashboards provided both system-level and individual job-level visibility. Due to limited capacity, jobs were migrated in batches, prioritized by throughput, business need, and special cases. Careful planning involved excluding jobs with known issues to reduce noise and performing full data dumps to correct snapshots if data quality issues were detected post-migration.

💡

System Design Takeaways

When planning large-scale system migrations, prioritize a phased approach (e.g., shadow/reverse shadow), invest in comprehensive automated monitoring and data quality validation, and design for rapid rollback capabilities. The ability to compare old and new system outputs in parallel and quickly halt bad data propagation is paramount for maintaining reliability.

data ingestiondata migrationchange data capturemysqldata warehousescalabilityreliabilitymeta

Comments

Loading comments...

Architecture Design

Design this yourself

Design a petabyte-scale data ingestion and warehousing system that reliably scrapes data from a large OLTP database (like MySQL) into a data warehouse for analytics and machine learning. Focus on the migration strategy for transitioning from a legacy system to a new architecture, including mechanisms for ensuring data consistency, managing rollback, and automating the migration process for thousands of jobs.

Practice Interview

Other design angles

· Design a robust CDC (Change Data Capture) pipeline specifically for ensuring data consistency during database migrations, considering techniques like checksums and row count validation.· Design a highly available and scalable data warehouse architecture that supports incremental data loading and powers diverse analytical and machine learning workloads.· Design a system for continuous data quality monitoring and automated remediation in a large-scale data pipeline environment.