Stream Processing
Real-time data processing: Kafka Streams, Apache Flink, windowing, stateful processing, and stream-table duality.
Batch Processing vs Stream Processing
Batch processing collects data over a period (an hour, a day) and processes it all at once. It's efficient and simple but introduces latency — you can't answer 'how many users signed up in the last 5 minutes?' until the next batch runs. Stream processing processes data record-by-record as it arrives, enabling real-time analytics, alerting, and continuous computation.
| Dimension | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (large datasets) | High (continuous) |
| Complexity | Lower — simple jobs | Higher — stateful, windowing |
| Use cases | Nightly reports, ETL, ML training | Real-time dashboards, fraud detection, alerting |
| Examples | Spark batch, Hadoop MapReduce | Kafka Streams, Flink, Spark Streaming |
Windowing
Stream processing often requires aggregating events over a time window: 'count clicks per minute,' 'compute average order value per hour.' Windowing divides the infinite stream into finite chunks for aggregation.
| Window Type | Description | Example |
|---|---|---|
| Tumbling | Fixed, non-overlapping intervals | Count events every 5 minutes (0:00-0:05, 0:05-0:10) |
| Sliding | Fixed size, moves by slide interval — windows overlap | Count events in last 5 min, updated every 1 min |
| Session | Dynamic — gap-based grouping of activity | Group all user clicks with < 30 min inactivity gap |
| Hopping | Fixed size but overlaps on hop interval | 5-min window, hop every 1 min — overlapping coverage |
Event Time vs Processing Time
Processing time is when the event arrives at the stream processor. Event time is when the event actually occurred (recorded in the event itself). These differ due to network delays, mobile devices that batch events, or retries. For accurate analytics, always use event time — but this requires handling late-arriving events.
Watermarks handle late data
A watermark is a heuristic threshold: 'I believe all events up to time T have arrived — I'll now close and compute the window.' Events arriving after the watermark are either dropped, assigned to a late window, or trigger window recomputation. Apache Flink has first-class watermark support.
Kafka Streams
Kafka Streams is a client library (not a separate cluster) that processes records from Kafka topics and writes results back to Kafka topics. Because it runs in your application process, there is no separate infrastructure to manage beyond your Kafka cluster.
// Kafka Streams: count orders per product in 5-minute tumbling windows
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Order> orders = builder.stream("orders");
orders
.groupBy((key, order) -> order.getProductId())
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
.count()
.toStream()
.to("order-counts-per-product");
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();Kafka Streams uses state stores (RocksDB by default) for stateful operations like aggregations and joins. State is backed up to Kafka changelog topics, enabling recovery if a node fails.
Apache Flink
Apache Flink is a full-featured stream processing framework designed for large-scale, stateful computation. Unlike Kafka Streams (which runs in your app), Flink is a distributed execution engine with its own cluster (or Kubernetes deployment). Flink excels at complex event processing, ML inference pipelines, and exactly-once processing across heterogeneous sources.
| Feature | Kafka Streams | Apache Flink |
|---|---|---|
| Deployment | Library — runs in your app | Separate cluster or Kubernetes |
| Sources | Kafka only | Kafka, S3, databases, HTTP, custom |
| State management | RocksDB (local) | RocksDB + remote checkpoints |
| Exactly-once | Yes (Kafka-to-Kafka) | Yes (across sources) |
| Complexity | Lower — simpler API | Higher — but more powerful |
| Best for | Kafka-native pipelines | Complex event processing, multi-source ETL |
Stream-Table Duality
A profound concept in stream processing: every stream can be viewed as a table (the current state) and every table can be viewed as a stream (the changelog). A Kafka topic of user-update events is a stream; compress it to keep only the latest event per userId and you have a table (a KTable in Kafka Streams). This duality enables stream-table joins: enriching a stream of orders with the latest user profile from a KTable.
Practical example: enriching events
You have a stream of click events and a table of user profiles (materialized from a users topic). With a stream-table join, each click event is enriched with the user's country and plan tier — in real time, without querying a database.
Real-Time Analytics Architecture
Interview Tip
Stream processing in interviews often comes up in 'design a real-time leaderboard' or 'design analytics for a live event' questions. Key concepts to mention: (1) Kafka as the event source. (2) A stream processor (Kafka Streams for simplicity, Flink for scale). (3) Tumbling windows for periodic aggregation. (4) An OLAP database (ClickHouse, Druid, BigQuery) for low-latency queries. Explain the tradeoff: exactly-once is expensive — at-least-once + idempotent aggregation is usually sufficient.