Uber re-architected its data lake ingestion platform from batch processing to a streaming-first model with IngestionNext. This shift significantly reduced data ingestion latency from hours to minutes, enabling faster data availability for analytics and machine learning workloads. The new architecture leverages Apache Kafka, Flink, and Hudi to support continuous event stream processing, transactional commits, and improved resource efficiency.
Read original on InfoQ ArchitectureUber's IngestionNext represents a fundamental shift in data ingestion strategy, moving away from traditional scheduled batch jobs to a continuous streaming-first approach. This re-architecture was driven by the need for fresher data to support real-time analytics, experimentation platforms, and machine learning models, which were hampered by the hours-long latency of the previous batch-based pipelines (primarily Apache Spark).
Why Streaming Matters
Data freshness is a critical dimension of data quality. Architecting for real-time data ingestion can unlock significant business value by enabling faster decision-making and more responsive applications.
The IngestionNext platform is built on a robust set of distributed systems components: events flow through Apache Kafka for reliable messaging, are processed continuously by Apache Flink jobs, and then written to Apache Hudi tables in the data lake. Hudi provides critical features for a streaming data lake, including transactional commits, rollbacks, and time travel capabilities, which are essential for maintaining data consistency and integrity in a continuous ingestion environment.
Migrating to a streaming model introduced challenges, particularly around file management in the data lake. Continuous writes can lead to many small files, which degrade query performance and storage efficiency. Uber addressed this with row-group-level merging strategies for Parquet files and compaction mechanisms to maintain optimal file layouts. Additionally, the system implemented robust checkpointing, partition skew handling, and recovery mechanisms in Flink jobs to ensure data correctness and reliability during failures by tracking stream offsets and coordinating commits. This approach significantly improved resource efficiency, cutting compute usage by approximately 25% compared to the batch system.
IngestionNext successfully reduced data ingestion latency from hours to minutes and achieved a 25% reduction in compute resources. This forms the foundation for a truly end-to-end real-time data stack. Future work will focus on extending streaming capabilities further into downstream transformation and analytics pipelines to ensure the freshness improvements propagate across the entire data workflow.