InfoQ Architecture·March 2, 2026

Uber's Hybrid Cloud Data Replication: Scaling Petabytes with Distcp and HiveSync

This article details how Uber engineered its data replication platform to handle petabytes of data daily across hybrid cloud and on-premise data lakes. By optimizing the open-source Distcp framework and developing HiveSync, Uber addressed extreme scaling challenges for analytics, machine learning, and disaster recovery. The enhancements focused on parallelization, efficiency improvements for various job sizes, and robust observability to ensure reliability during massive data migrations.

Databases & Storage Distributed Systems Cloud & Infrastructure

Read original on InfoQ Architecture

Uber faced significant challenges in replicating petabytes of data daily between its on-premise HDFS and cloud data lakes due to rapidly growing workloads. Their solution involved heavily optimizing and building upon Hadoop's open-source Distcp framework, creating a robust platform known as HiveSync. This re-engineering effort allowed them to manage extreme-scale data replication crucial for analytics, machine learning, and disaster recovery across a hybrid cloud environment.

The Core Architecture: Distcp and HiveSync

The foundation of Uber's replication system is Distcp, which leverages Hadoop's MapReduce for parallel data copying. Files are split into blocks, processed by Copy Mapper tasks in YARN containers, and then reassembled. Uber's HiveSync, initially inspired by Airbnb's ReAir, orchestrates both bulk and incremental replication, submitting Distcp jobs asynchronously for larger datasets (>256 MB) and using a monitoring thread to track progress. The primary challenge was scaling this architecture from 250 TB to over 1 PB daily replication and handling a massive increase in datasets.

Key Optimizations for Extreme Scale

Offloading and Parallelization: Resource-intensive tasks like Copy Listing and Input Splitting were moved from the HiveSync server to the Application Master, drastically reducing HDFS client contention and job submission latency (up to 90%). Copy Listing and Copy Committer tasks were parallelized to process multiple files simultaneously, improving p99 listing latency by 60% and maximum commit latency by over 97%.
Efficiency for Small Transfers: For smaller jobs (fewer than 200 files or 512 MB), Hadoop's Uber job feature was utilized, running Copy Mapper tasks directly within the Application Master's JVM. This eliminated ~268,000 container launches daily, significantly boosting YARN efficiency.
Enhanced Observability: Improved metrics for job submission, Copy Listing, Committer, heap usage, and p99 copy rates provided critical insights, enabling engineers to preempt failures and monitor workloads effectively.
Robustness and Mitigation: Stress testing, circuit breakers, optimized YARN configurations, and reordered task execution were implemented to mitigate common issues like out-of-memory errors and long-running tasks.

💡

System Design Lesson

When dealing with extreme-scale data processing, identifying bottlenecks in resource-intensive phases (like listing or committing) and strategically parallelizing or offloading those tasks can yield substantial performance gains. Furthermore, optimizing for different workload profiles (e.g., small versus large jobs) prevents inefficiencies and improves overall resource utilization.

These architectural and operational enhancements increased incremental replication capacity fivefold, allowing Uber to successfully replicate over 300 PB during their on-premise-to-cloud migration without major incidents. Future plans include further parallelization of file permission settings and input splitting, moving compute-intensive commit tasks to the Reduce phase, and implementing dynamic bandwidth throttling, with Uber planning to open-source these improvements.

Hybrid CloudData ReplicationHadoopDistcpHDFSData LakeYARNScalability

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable, fault-tolerant hybrid cloud data replication system capable of moving petabytes of data daily between on-premise data centers (HDFS) and cloud storage. Detail the architecture, including how you would handle job scheduling, parallel processing for large and small files, observability, and strategies for managing consistency and disaster recovery in an active-passive data lake model.

Practice Interview

Other design angles

· Design the optimization strategies for a large-scale distributed data copying tool (like Distcp) to improve its performance and efficiency for hybrid cloud scenarios, focusing on resource allocation and task scheduling.· Architect a monitoring and alerting system for a petabyte-scale data replication pipeline to detect backlogs, performance bottlenecks, and potential data inconsistencies across disparate environments.· Design a solution for migrating an existing on-premise HDFS data lake to a cloud-based object storage, ensuring continuous data availability and minimal downtime during the transition for critical applications.

Uber's Hybrid Cloud Data Replication: Scaling Petabytes with Distcp and HiveSync

The Core Architecture: Distcp and HiveSync

Key Optimizations for Extreme Scale

Comments

Architecture Design

Related Lessons