Apache Kafka Deep Dive
Kafka architecture: brokers, topics, partitions, consumer groups, offsets, exactly-once semantics, and when Kafka is the right choice.
What Makes Kafka Different
Apache Kafka is not a traditional message queue. It is a distributed commit log — an append-only, ordered, durable sequence of records. Unlike RabbitMQ or SQS where messages are deleted after consumption, Kafka retains messages for a configurable retention period (days, weeks, or forever with tiered storage). Consumers track their own position using offsets. This fundamentally changes the replay and reprocessing story.
Core Architecture
Topics and Partitions
A Kafka topic is divided into one or more partitions. Each partition is an ordered, immutable log. Producers write to partitions (round-robin by default, or keyed by a partition key). Consumers read from partitions sequentially using an offset — an integer index into the log.
Partitions are the unit of parallelism. With 12 partitions and 12 consumer instances in a group, you get 12x parallel consumption. You cannot have more active consumers than partitions in a group — extra consumers sit idle.
Partition key selection matters
When you provide a partition key (e.g., `userId`, `orderId`), Kafka hashes it to always route records with the same key to the same partition. This guarantees ordering within a key. Without a key, Kafka round-robins across partitions — higher throughput, no per-key ordering.
Consumer Groups and Offsets
A consumer group is a set of consumer instances that share the work of reading a topic. Kafka assigns each partition to exactly one consumer in the group. If an instance fails, Kafka rebalances — redistributing that partition to another instance. Each group independently tracks its own committed offset per partition in an internal Kafka topic called `__consumer_offsets`.
| Scenario | Partition Count | Consumer Instances | Result |
|---|---|---|---|
| Under-provisioned | 4 | 2 | Each consumer reads 2 partitions — works fine |
| Perfectly matched | 4 | 4 | Each consumer reads 1 partition — maximum parallelism |
| Over-provisioned | 4 | 8 | 4 consumers active, 4 idle — wasted resources |
Replication and Durability
Each partition has a replication factor (typically 3). One broker is the leader for that partition — it handles all reads and writes. The other brokers are followers that replicate from the leader. If the leader fails, a follower is elected. The `acks` producer setting controls durability:
| acks setting | Behavior | Risk |
|---|---|---|
| acks=0 | Fire and forget — no wait for broker | Message loss on broker failure |
| acks=1 | Wait for leader to acknowledge | Loss if leader fails before replication |
| acks=all (-1) | Wait for all in-sync replicas (ISR) | Safest — no loss if replicas ≥ 2 |
Exactly-Once Semantics
Kafka supports idempotent producers (enable with `enable.idempotence=true`) to prevent duplicate writes from retries. For end-to-end exactly-once, Kafka transactions let you atomically write to multiple partitions and commit offsets in a single atomic operation. This enables read-process-write pipelines (Kafka Streams style) with exactly-once guarantees.
// Kafka producer with exactly-once semantics (Java)
Properties props = new Properties();
props.put("bootstrap.servers", "kafka:9092");
props.put("enable.idempotence", "true");
props.put("acks", "all");
props.put("transactional.id", "my-transactional-id");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("output-topic", key, value));
// Atomically commit processed offset + new record
producer.sendOffsetsToTransaction(offsets, consumerGroupId);
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}Kafka's Retention Model vs Traditional Queue
This is where Kafka fundamentally differs. Traditional queues delete messages after acknowledgment. Kafka retains messages for a configured period (default 7 days). This enables:
- Replay — Re-process all events from the beginning when you deploy a new consumer service
- Multiple independent consumers — Each consumer group has its own offset; 10 different teams can read the same topic independently
- Time-travel debugging — Rewind a consumer to a specific timestamp to diagnose production issues
- Event sourcing — The Kafka topic IS the source of truth; the database is a projection
When to Choose Kafka
| Choose Kafka when... | Choose a traditional queue when... |
|---|---|
| High throughput (millions of events/sec) | Moderate throughput is sufficient |
| Multiple independent consumers need the same events | Single consumer per message is fine |
| Replay and reprocessing are required | Messages can be discarded after processing |
| Event sourcing / audit log use case | Simple task queue (jobs, background work) |
| Stream processing with Kafka Streams | Rich routing logic (RabbitMQ exchanges) |
Kafka is operationally heavy
Kafka requires ZooKeeper (or KRaft in newer versions), careful partition planning, monitoring of consumer lag, and tuning of retention and replication. Managed services (Amazon MSK, Confluent Cloud) significantly reduce this burden, but Kafka is still heavier than SQS or RabbitMQ for simple use cases.
Interview Tip
Kafka comes up constantly in system design interviews. Key points to hit: (1) It's a distributed commit log, not just a queue — messages are retained. (2) Partitions are the unit of parallelism — scale consumers by adding partitions. (3) Consumer groups let multiple independent services read the same events. (4) Use a partition key for ordering guarantees within a key. If asked 'Kafka vs SQS?' — Kafka wins on throughput, replay, and multi-consumer; SQS wins on simplicity and managed operations.
Practice this pattern
Design a real-time event processing pipeline using Kafka