Menu
InfoQ Architecture·March 25, 2026

Uber's IngestionNext: Architecting a Streaming-First Data Lake for Reduced Latency and Cost

Uber re-architected its data lake ingestion platform from batch processing to a streaming-first model with IngestionNext. This shift significantly reduced data ingestion latency from hours to minutes, enabling faster data availability for analytics and machine learning workloads. The new architecture leverages Apache Kafka, Flink, and Hudi to support continuous event stream processing, transactional commits, and improved resource efficiency.

Read original on InfoQ Architecture

Transitioning from Batch to Streaming Data Lakes

Uber's IngestionNext represents a fundamental shift in data ingestion strategy, moving away from traditional scheduled batch jobs to a continuous streaming-first approach. This re-architecture was driven by the need for fresher data to support real-time analytics, experimentation platforms, and machine learning models, which were hampered by the hours-long latency of the previous batch-based pipelines (primarily Apache Spark).

ℹ️

Why Streaming Matters

Data freshness is a critical dimension of data quality. Architecting for real-time data ingestion can unlock significant business value by enabling faster decision-making and more responsive applications.

Core Architecture of IngestionNext

The IngestionNext platform is built on a robust set of distributed systems components: events flow through Apache Kafka for reliable messaging, are processed continuously by Apache Flink jobs, and then written to Apache Hudi tables in the data lake. Hudi provides critical features for a streaming data lake, including transactional commits, rollbacks, and time travel capabilities, which are essential for maintaining data consistency and integrity in a continuous ingestion environment.

  • Apache Kafka: Distributed streaming platform for high-throughput, fault-tolerant event ingestion.
  • Apache Flink: Stream processing framework for continuous, low-latency processing of event data.
  • Apache Hudi: Data lake platform that enables transactional data management (updates, deletes) on large datasets in formats like Parquet, supporting both batch and streaming workloads.

Addressing Technical Challenges in Streaming Data Lakes

Migrating to a streaming model introduced challenges, particularly around file management in the data lake. Continuous writes can lead to many small files, which degrade query performance and storage efficiency. Uber addressed this with row-group-level merging strategies for Parquet files and compaction mechanisms to maintain optimal file layouts. Additionally, the system implemented robust checkpointing, partition skew handling, and recovery mechanisms in Flink jobs to ensure data correctness and reliability during failures by tracking stream offsets and coordinating commits. This approach significantly improved resource efficiency, cutting compute usage by approximately 25% compared to the batch system.

Impact and Future Directions

IngestionNext successfully reduced data ingestion latency from hours to minutes and achieved a 25% reduction in compute resources. This forms the foundation for a truly end-to-end real-time data stack. Future work will focus on extending streaming capabilities further into downstream transformation and analytics pipelines to ensure the freshness improvements propagate across the entire data workflow.

data lakestreamingApache KafkaApache FlinkApache Hudireal-time analyticsdata ingestionscalability

Comments

Loading comments...