Scaling Data Replication, Custom Tools vs. Managed Services?
Sonia Benali
·7325 views
Uber's way of doing petabyte-scale data replication, using their tweaked Distcp and HiveSync, really made me think about whether to build something yourself or just buy it for data infrastructure. When you're handling such huge amounts of data and have very particular needs, like hybrid clouds and lots of data lakes, is building your own stuff always the best way to go? What do you lose, and what do you gain, when you look at engineering time, keeping it running forever, and how flexible it is, compared to just using managed replication services from cloud companies? I've been doodling some designs for similar problems myself. I even tried designing this on SysDesAi, and the step-by-step helped me see some of the data flow choices more clearly. So, I'm eager to hear how other people decide this when they're planning for really big data volumes.
26 comments