Course/Message Queues & Streaming/Stream Processing

Stream Processing

Real-time data processing: Kafka Streams, Apache Flink, windowing, stateful processing, and stream-table duality.

15 min read

Batch Processing vs Stream Processing

Batch processing collects data over a period (an hour, a day) and processes it all at once. It's efficient and simple but introduces latency — you can't answer 'how many users signed up in the last 5 minutes?' until the next batch runs. Stream processing processes data record-by-record as it arrives, enabling real-time analytics, alerting, and continuous computation.

Dimension	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (large datasets)	High (continuous)
Complexity	Lower — simple jobs	Higher — stateful, windowing
Use cases	Nightly reports, ETL, ML training	Real-time dashboards, fraud detection, alerting
Examples	Spark batch, Hadoop MapReduce	Kafka Streams, Flink, Spark Streaming

Windowing

Stream processing often requires aggregating events over a time window: 'count clicks per minute,' 'compute average order value per hour.' Windowing divides the infinite stream into finite chunks for aggregation.

Window Type	Description	Example
Tumbling	Fixed, non-overlapping intervals	Count events every 5 minutes (0:00-0:05, 0:05-0:10)
Sliding	Fixed size, moves by slide interval — windows overlap	Count events in last 5 min, updated every 1 min
Session	Dynamic — gap-based grouping of activity	Group all user clicks with < 30 min inactivity gap
Hopping	Fixed size but overlaps on hop interval	5-min window, hop every 1 min — overlapping coverage

Event Time vs Processing Time

Processing time is when the event arrives at the stream processor. Event time is when the event actually occurred (recorded in the event itself). These differ due to network delays, mobile devices that batch events, or retries. For accurate analytics, always use event time — but this requires handling late-arriving events.

ℹ️

Watermarks handle late data

A watermark is a heuristic threshold: 'I believe all events up to time T have arrived — I'll now close and compute the window.' Events arriving after the watermark are either dropped, assigned to a late window, or trigger window recomputation. Apache Flink has first-class watermark support.

Kafka Streams

Kafka Streams is a client library (not a separate cluster) that processes records from Kafka topics and writes results back to Kafka topics. Because it runs in your application process, there is no separate infrastructure to manage beyond your Kafka cluster.

java

// Kafka Streams: count orders per product in 5-minute tumbling windows
StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("orders");

orders
  .groupBy((key, order) -> order.getProductId())
  .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
  .count()
  .toStream()
  .to("order-counts-per-product");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Kafka Streams uses state stores (RocksDB by default) for stateful operations like aggregations and joins. State is backed up to Kafka changelog topics, enabling recovery if a node fails.

Apache Flink

Apache Flink is a full-featured stream processing framework designed for large-scale, stateful computation. Unlike Kafka Streams (which runs in your app), Flink is a distributed execution engine with its own cluster (or Kubernetes deployment). Flink excels at complex event processing, ML inference pipelines, and exactly-once processing across heterogeneous sources.

Feature	Kafka Streams	Apache Flink
Deployment	Library — runs in your app	Separate cluster or Kubernetes
Sources	Kafka only	Kafka, S3, databases, HTTP, custom
State management	RocksDB (local)	RocksDB + remote checkpoints
Exactly-once	Yes (Kafka-to-Kafka)	Yes (across sources)
Complexity	Lower — simpler API	Higher — but more powerful
Best for	Kafka-native pipelines	Complex event processing, multi-source ETL

Stream-Table Duality

A profound concept in stream processing: every stream can be viewed as a table (the current state) and every table can be viewed as a stream (the changelog). A Kafka topic of user-update events is a stream; compress it to keep only the latest event per userId and you have a table (a KTable in Kafka Streams). This duality enables stream-table joins: enriching a stream of orders with the latest user profile from a KTable.

📌

Practical example: enriching events

You have a stream of click events and a table of user profiles (materialized from a users topic). With a stream-table join, each click event is enriched with the user's country and plan tier — in real time, without querying a database.

Real-Time Analytics Architecture

Loading diagram...

Lambda-free real-time analytics: Kafka feeds a stream processor, which writes aggregates to an OLAP store for dashboards and raw events to a data lake

💡

Interview Tip

Stream processing in interviews often comes up in 'design a real-time leaderboard' or 'design analytics for a live event' questions. Key concepts to mention: (1) Kafka as the event source. (2) A stream processor (Kafka Streams for simplicity, Flink for scale). (3) Tumbling windows for periodic aggregation. (4) An OLAP database (ClickHouse, Druid, BigQuery) for low-latency queries. Explain the tradeoff: exactly-once is expensive — at-least-once + idempotent aggregation is usually sufficient.

Event-Driven Architecture

Dead Letter Queues & Retry Patterns