The engineering team at Meta recently outlined how the company migrated a data ingestion platform that transfers several petabytes of MySQL social graph data daily to improve reliability and operational efficiency. The team used techniques like reverse shadowing and continuous checksum monitoring to ensure zero downtime during the transition.
Meta operates one of the world’s largest MySQL deployments, with a data ingestion platform that supports analytics, reporting, machine learning, and internal product development workloads. The company recently redesigned its architecture, replacing customer-owned pipelines with a centralized, self-managed warehouse service.
With the migration, Meta replaced fragmented, pipeline-owned infrastructure with a centralized managed system, using staged migrations, automated validation, rollback controls, and compatibility layers to transition thousands of ingestion pipelines without disrupting downstream analytics and ML workloads.
Deploying distributed systems canarying at massive scale, Meta migrated ingestion jobs through three stages: a shadow phase that validated the new system against production data, a reverse shadow phase that swapped production ownership while preserving rollback capability, and a cleanup phase that retired the legacy pipeline after consistency and performance checks passed. Zihao Tao, software engineer at Meta, and colleagues from the engineering team explain:
We continuously monitored row count and checksum mismatches between the production jobs and the shadow jobs. When mismatches occurred, we quickly investigated the root cause and deployed fixes to the pre-production environment, then verified that the mismatch was resolved. During this step, we also measured the compute and storage quotas for the shadow jobs to ensure that the production environment had sufficient resources before proceeding.

Source: Meta engineering blog
Having now completed the migration of the entire data ingestion workload and retired the legacy system, the team acknowledges the challenge of the large-scale infrastructure transition:
Ensuring a seamless migration meant we had to effectively track the migration lifecycle for thousands of jobs and put robust rollout and rollback controls in place to handle issues that might arise during the migration process.
Each migration job had to be validated against strict correctness and performance checks before rollout, comparing row counts and checksums between old and new systems, monitoring latency and resource usage for regressions, and applying additional requirements for critical tables used by dependent teams. The team explains:
Both our legacy and new data ingestion systems used change data capture (CDC) to incrementally ingest data into the target table. Each data ingestion job has its own internal table for a full dump of source databases (full dump), an internal table for capturing changes of source databases (delta), and the target table consumed by the data customers. All the information about job entities, including table names and table schemas, is saved and managed by the central management service.

Source: Meta engineering blog
Syed Moeen Kazmi comments:
Migrating data ingestion at Meta scale isn't an upgrade. It's open-heart surgery on core business. The challenge isn't just moving data, it's maintaining consistency and zero downtime.
Because the CDC architecture relied on expensive full snapshots for initial loads and post-fix recovery, Meta minimized the creation of unnecessary shadow jobs until data quality issues were resolved. This avoided repeated large-scale full dumps and significantly improved migration efficiency. The team also reduced infrastructure load by reusing snapshot partitions from the legacy system during initial migration stages.