This article details how Uber engineered its data replication platform to handle petabytes of data daily across hybrid cloud and on-premise data lakes. By optimizing the open-source Distcp framework and developing HiveSync, Uber addressed extreme scaling challenges for analytics, machine learning, and disaster recovery. The enhancements focused on parallelization, efficiency improvements for various job sizes, and robust observability to ensure reliability during massive data migrations.
Read original on InfoQ ArchitectureUber faced significant challenges in replicating petabytes of data daily between its on-premise HDFS and cloud data lakes due to rapidly growing workloads. Their solution involved heavily optimizing and building upon Hadoop's open-source Distcp framework, creating a robust platform known as HiveSync. This re-engineering effort allowed them to manage extreme-scale data replication crucial for analytics, machine learning, and disaster recovery across a hybrid cloud environment.
The foundation of Uber's replication system is Distcp, which leverages Hadoop's MapReduce for parallel data copying. Files are split into blocks, processed by Copy Mapper tasks in YARN containers, and then reassembled. Uber's HiveSync, initially inspired by Airbnb's ReAir, orchestrates both bulk and incremental replication, submitting Distcp jobs asynchronously for larger datasets (>256 MB) and using a monitoring thread to track progress. The primary challenge was scaling this architecture from 250 TB to over 1 PB daily replication and handling a massive increase in datasets.
System Design Lesson
When dealing with extreme-scale data processing, identifying bottlenecks in resource-intensive phases (like listing or committing) and strategically parallelizing or offloading those tasks can yield substantial performance gains. Furthermore, optimizing for different workload profiles (e.g., small versus large jobs) prevents inefficiencies and improves overall resource utilization.
These architectural and operational enhancements increased incremental replication capacity fivefold, allowing Uber to successfully replicate over 300 PB during their on-premise-to-cloud migration without major incidents. Future plans include further parallelization of file permission settings and input splitting, moving compute-intensive commit tasks to the Reduce phase, and implementing dynamic bandwidth throttling, with Uber planning to open-source these improvements.