Menu

Stream Processing

Real-time data processing: Kafka Streams, Apache Flink, windowing, stateful processing, and stream-table duality.

15 min read

Batch Processing vs Stream Processing

Batch processing collects data over a period (an hour, a day) and processes it all at once. It's efficient and simple but introduces latency — you can't answer 'how many users signed up in the last 5 minutes?' until the next batch runs. Stream processing processes data record-by-record as it arrives, enabling real-time analytics, alerting, and continuous computation.

DimensionBatch ProcessingStream Processing
LatencyMinutes to hoursMilliseconds to seconds
ThroughputVery high (large datasets)High (continuous)
ComplexityLower — simple jobsHigher — stateful, windowing
Use casesNightly reports, ETL, ML trainingReal-time dashboards, fraud detection, alerting
ExamplesSpark batch, Hadoop MapReduceKafka Streams, Flink, Spark Streaming

Windowing

Stream processing often requires aggregating events over a time window: 'count clicks per minute,' 'compute average order value per hour.' Windowing divides the infinite stream into finite chunks for aggregation.

Window TypeDescriptionExample
TumblingFixed, non-overlapping intervalsCount events every 5 minutes (0:00-0:05, 0:05-0:10)
SlidingFixed size, moves by slide interval — windows overlapCount events in last 5 min, updated every 1 min
SessionDynamic — gap-based grouping of activityGroup all user clicks with < 30 min inactivity gap
HoppingFixed size but overlaps on hop interval5-min window, hop every 1 min — overlapping coverage

Event Time vs Processing Time

Processing time is when the event arrives at the stream processor. Event time is when the event actually occurred (recorded in the event itself). These differ due to network delays, mobile devices that batch events, or retries. For accurate analytics, always use event time — but this requires handling late-arriving events.

ℹ️

Watermarks handle late data

A watermark is a heuristic threshold: 'I believe all events up to time T have arrived — I'll now close and compute the window.' Events arriving after the watermark are either dropped, assigned to a late window, or trigger window recomputation. Apache Flink has first-class watermark support.

Kafka Streams

Kafka Streams is a client library (not a separate cluster) that processes records from Kafka topics and writes results back to Kafka topics. Because it runs in your application process, there is no separate infrastructure to manage beyond your Kafka cluster.

java
// Kafka Streams: count orders per product in 5-minute tumbling windows
StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("orders");

orders
  .groupBy((key, order) -> order.getProductId())
  .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
  .count()
  .toStream()
  .to("order-counts-per-product");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Kafka Streams uses state stores (RocksDB by default) for stateful operations like aggregations and joins. State is backed up to Kafka changelog topics, enabling recovery if a node fails.

Apache Flink is a full-featured stream processing framework designed for large-scale, stateful computation. Unlike Kafka Streams (which runs in your app), Flink is a distributed execution engine with its own cluster (or Kubernetes deployment). Flink excels at complex event processing, ML inference pipelines, and exactly-once processing across heterogeneous sources.

FeatureKafka StreamsApache Flink
DeploymentLibrary — runs in your appSeparate cluster or Kubernetes
SourcesKafka onlyKafka, S3, databases, HTTP, custom
State managementRocksDB (local)RocksDB + remote checkpoints
Exactly-onceYes (Kafka-to-Kafka)Yes (across sources)
ComplexityLower — simpler APIHigher — but more powerful
Best forKafka-native pipelinesComplex event processing, multi-source ETL

Stream-Table Duality

A profound concept in stream processing: every stream can be viewed as a table (the current state) and every table can be viewed as a stream (the changelog). A Kafka topic of user-update events is a stream; compress it to keep only the latest event per userId and you have a table (a KTable in Kafka Streams). This duality enables stream-table joins: enriching a stream of orders with the latest user profile from a KTable.

📌

Practical example: enriching events

You have a stream of click events and a table of user profiles (materialized from a users topic). With a stream-table join, each click event is enriched with the user's country and plan tier — in real time, without querying a database.

Real-Time Analytics Architecture

Loading diagram...
Lambda-free real-time analytics: Kafka feeds a stream processor, which writes aggregates to an OLAP store for dashboards and raw events to a data lake
💡

Interview Tip

Stream processing in interviews often comes up in 'design a real-time leaderboard' or 'design analytics for a live event' questions. Key concepts to mention: (1) Kafka as the event source. (2) A stream processor (Kafka Streams for simplicity, Flink for scale). (3) Tumbling windows for periodic aggregation. (4) An OLAP database (ClickHouse, Druid, BigQuery) for low-latency queries. Explain the tradeoff: exactly-once is expensive — at-least-once + idempotent aggregation is usually sufficient.

📝

Knowledge Check

4 questions

Test your understanding of this lesson. Score 70% or higher to complete.

Ask about this lesson

Ask anything about Stream Processing