Distributed Consensus: Paxos & Raft
How distributed systems agree on values: the consensus problem, Paxos protocol, Raft simplification, leader election, and log replication.
The Consensus Problem
Distributed consensus is the challenge of getting a group of nodes to agree on a single value — even when nodes crash, messages are delayed, or the network partitions. It is the backbone of every replicated system: distributed databases, coordination services (ZooKeeper, etcd), and leader election. Without a correct consensus protocol, replicated state machines diverge and data is lost or corrupted.
The FLP impossibility result (Fischer, Lynch, Paterson 1985) proves that no deterministic consensus algorithm can guarantee termination in an asynchronous network if even one process can crash. In practice, systems sidestep this by using timeouts and partial synchrony assumptions — the network may be slow, but eventually messages arrive.
Three properties consensus must satisfy
Agreement: all non-faulty nodes decide the same value. Validity: the decided value must have been proposed by some node. Termination: all non-faulty nodes eventually decide. Real protocols trade termination for safety under partition (see CAP theorem).
Paxos
Paxos (Lamport 1989, published 1998) was the first widely studied practical consensus algorithm. It operates in two phases and uses three roles: Proposers initiate consensus, Acceptors vote on proposals, and Learners learn the chosen value. A majority quorum of acceptors must respond for progress.
Multi-Paxos extends this to agree on a sequence of values (a log) by electing a stable leader who can skip Phase 1 for subsequent entries, reducing message complexity. However, Paxos is notoriously difficult to implement correctly — Lamport himself wrote a 2001 paper titled "Paxos Made Simple" to address the confusion.
Raft: Consensus for Understandable Systems
Raft (Ongaro & Ousterhout 2014) was designed explicitly for understandability. It decomposes consensus into three relatively independent sub-problems: leader election, log replication, and safety. Raft elects one strong leader at a time; all writes go through the leader, which simplifies reasoning about the system.
Terms are Raft's logical clock. Each term begins with an election. If a follower receives no heartbeat within an election timeout (typically 150–300 ms), it increments its term, becomes a candidate, and requests votes. A node grants its vote if the candidate's log is at least as up-to-date as its own. The first candidate to win a majority becomes leader.
Log Replication
Once a leader is elected, all client writes arrive at the leader, which appends the entry to its log and sends `AppendEntries` RPCs to followers in parallel. An entry is committed once a majority of nodes have written it to their logs. The leader then applies the entry to its state machine and responds to the client. Followers apply committed entries during the next `AppendEntries` heartbeat.
// Raft Leader — handling a client write
function handleClientWrite(value):
entry = { term: currentTerm, index: log.length + 1, value }
log.append(entry)
// Send AppendEntries to all followers in parallel
responses = parallel_send_all(followers, AppendEntries {
term: currentTerm,
leaderId: self.id,
prevLogIndex: entry.index - 1,
prevLogTerm: log[entry.index - 1].term,
entries: [entry],
leaderCommit: commitIndex
})
// Commit when majority acknowledges
successCount = 1 // count self
for r in responses:
if r.success: successCount++
if successCount > (clusterSize / 2):
commitIndex = entry.index
applyToStateMachine(entry)
return SUCCESS
else:
return RETRYPaxos vs Raft Comparison
| Property | Paxos (Multi-Paxos) | Raft |
|---|---|---|
| Leadership | Implicit — any node can propose | Explicit strong leader |
| Log holes | Possible — entries can be sparse | No holes — sequential |
| Understandability | Notoriously complex | Designed to be simple |
| Leader election | Ad-hoc, no built-in mechanism | Term-based, randomized timeout |
| Reads from follower | Complex — requires lease/quorum | Supported with lease reads |
| Used in | Google Chubby, Apache Zookeeper (ZAB variant) | etcd, CockroachDB, TiKV, Consul |
Real-World Systems
- etcd (Kubernetes): Uses Raft for storing cluster state. Every Kubernetes control-plane decision goes through etcd consensus.
- CockroachDB: Uses Raft per-range (64 MB shards of the keyspace), giving per-shard fault tolerance.
- Apache Kafka (post-3.0): Replaced ZooKeeper with KRaft (Kafka Raft) for its own metadata management.
- Google Spanner: Uses Paxos per shard combined with TrueTime for externally consistent transactions.
- Consul: Uses Raft for service catalog and key-value storage; handles up to ~3 data centers natively.
Interview Tip
When asked about consensus in an interview, go beyond just naming Raft/Paxos. Explain the core insight: a majority quorum (N/2 + 1 nodes) guarantees that any two quorums overlap in at least one node, so the latest committed value is always known. Mention that Raft is used in etcd and therefore underpins every Kubernetes cluster. If the interviewer probes deeper, discuss how read-heavy systems use lease-based reads to avoid quorum round trips on reads.