Menu
InfoQ Architecture·March 2, 2026

Uber's Hybrid Cloud Data Replication: Scaling Petabytes with Distcp and HiveSync

This article details how Uber engineered its data replication platform to handle petabytes of data daily across hybrid cloud and on-premise data lakes. By optimizing the open-source Distcp framework and developing HiveSync, Uber addressed extreme scaling challenges for analytics, machine learning, and disaster recovery. The enhancements focused on parallelization, efficiency improvements for various job sizes, and robust observability to ensure reliability during massive data migrations.

Read original on InfoQ Architecture

Uber faced significant challenges in replicating petabytes of data daily between its on-premise HDFS and cloud data lakes due to rapidly growing workloads. Their solution involved heavily optimizing and building upon Hadoop's open-source Distcp framework, creating a robust platform known as HiveSync. This re-engineering effort allowed them to manage extreme-scale data replication crucial for analytics, machine learning, and disaster recovery across a hybrid cloud environment.

The Core Architecture: Distcp and HiveSync

The foundation of Uber's replication system is Distcp, which leverages Hadoop's MapReduce for parallel data copying. Files are split into blocks, processed by Copy Mapper tasks in YARN containers, and then reassembled. Uber's HiveSync, initially inspired by Airbnb's ReAir, orchestrates both bulk and incremental replication, submitting Distcp jobs asynchronously for larger datasets (>256 MB) and using a monitoring thread to track progress. The primary challenge was scaling this architecture from 250 TB to over 1 PB daily replication and handling a massive increase in datasets.

Key Optimizations for Extreme Scale

  • Offloading and Parallelization: Resource-intensive tasks like Copy Listing and Input Splitting were moved from the HiveSync server to the Application Master, drastically reducing HDFS client contention and job submission latency (up to 90%). Copy Listing and Copy Committer tasks were parallelized to process multiple files simultaneously, improving p99 listing latency by 60% and maximum commit latency by over 97%.
  • Efficiency for Small Transfers: For smaller jobs (fewer than 200 files or 512 MB), Hadoop's Uber job feature was utilized, running Copy Mapper tasks directly within the Application Master's JVM. This eliminated ~268,000 container launches daily, significantly boosting YARN efficiency.
  • Enhanced Observability: Improved metrics for job submission, Copy Listing, Committer, heap usage, and p99 copy rates provided critical insights, enabling engineers to preempt failures and monitor workloads effectively.
  • Robustness and Mitigation: Stress testing, circuit breakers, optimized YARN configurations, and reordered task execution were implemented to mitigate common issues like out-of-memory errors and long-running tasks.
💡

System Design Lesson

When dealing with extreme-scale data processing, identifying bottlenecks in resource-intensive phases (like listing or committing) and strategically parallelizing or offloading those tasks can yield substantial performance gains. Furthermore, optimizing for different workload profiles (e.g., small versus large jobs) prevents inefficiencies and improves overall resource utilization.

These architectural and operational enhancements increased incremental replication capacity fivefold, allowing Uber to successfully replicate over 300 PB during their on-premise-to-cloud migration without major incidents. Future plans include further parallelization of file permission settings and input splitting, moving compute-intensive commit tasks to the Reduce phase, and implementing dynamic bandwidth throttling, with Uber planning to open-source these improvements.

Hybrid CloudData ReplicationHadoopDistcpHDFSData LakeYARNScalability

Comments

Loading comments...