This article details Booking.com's 20-year journey in integrating AI, focusing on the evolution of their data management and machine learning engineering infrastructure. It highlights their transition from a MySQL-centric stack to distributed systems like Hadoop, and then to a modern ML inference platform, addressing challenges in scalability, real-time predictions, and A/B testing methodologies.
Read original on InfoQ ArchitectureBooking.com's AI evolution showcases a typical journey of a large-scale enterprise from monolithic data management to sophisticated distributed systems. Their initial architecture relied heavily on Perl libraries and MySQL, which, while initially successful for their A/B testing experiments, faced significant challenges as data scaled.
Initially, Booking.com's database strategy was unique, utilizing numerous smaller MySQL instances (limited to 2TB) fitting into NVMe SSDs, achieving sub-350 microsecond point queries. This decentralized approach worked well until data volume outgrew its capabilities. The shift to Apache Hadoop for distributed storage and processing was a critical architectural decision to handle Petabytes of data and thousands of cores for their growing machine learning pipelines. However, Hadoop also presented its own set of scaling and operational issues for ML workloads.
Key Takeaway: Large-Scale Migration Challenges
The migration away from Hadoop was a monumental effort, taking approximately seven years. This emphasizes the significant architectural and operational overhead associated with large-scale infrastructure transitions. Their strategy involved mapping the ecosystem, usage analysis, applying PageRank for scope reduction, phased migration, and ultimately sunsetting Hadoop, with a unified command center being key to success.
The machine learning engineering stack evolved from simple Perl scripts to a sophisticated platform leveraging Apache Oozie with Python, Apache Spark with MLlib, H2O.ai, deep learning, and Generative AI. A pivotal moment in 2015 involved solving challenges related to real-time online inference at scale and feature engineering for consistent training and inference. As of 2024, their platform handles over 480 models, 400 billion predictions daily, with sub-20 millisecond latency, showcasing a highly optimized and performant inference infrastructure.
Booking.com's data-driven DNA originated from extensive A/B testing, running thousands of experiments in parallel. When evolving their complex ranking algorithm with machine learning, they faced challenges where their existing formula proved