Course/Distributed Systems/Distributed Consensus: Paxos & Raft

Distributed Consensus: Paxos & Raft

How distributed systems agree on values: the consensus problem, Paxos protocol, Raft simplification, leader election, and log replication.

20 min readHigh interview weight

The Consensus Problem

Distributed consensus is the challenge of getting a group of nodes to agree on a single value — even when nodes crash, messages are delayed, or the network partitions. It is the backbone of every replicated system: distributed databases, coordination services (ZooKeeper, etcd), and leader election. Without a correct consensus protocol, replicated state machines diverge and data is lost or corrupted.

The FLP impossibility result (Fischer, Lynch, Paterson 1985) proves that no deterministic consensus algorithm can guarantee termination in an asynchronous network if even one process can crash. In practice, systems sidestep this by using timeouts and partial synchrony assumptions — the network may be slow, but eventually messages arrive.

ℹ️

Three properties consensus must satisfy

Agreement: all non-faulty nodes decide the same value. Validity: the decided value must have been proposed by some node. Termination: all non-faulty nodes eventually decide. Real protocols trade termination for safety under partition (see CAP theorem).

Paxos

Paxos (Lamport 1989, published 1998) was the first widely studied practical consensus algorithm. It operates in two phases and uses three roles: Proposers initiate consensus, Acceptors vote on proposals, and Learners learn the chosen value. A majority quorum of acceptors must respond for progress.

Loading diagram...

Single-decree Paxos: two-phase protocol for agreeing on one value

Multi-Paxos extends this to agree on a sequence of values (a log) by electing a stable leader who can skip Phase 1 for subsequent entries, reducing message complexity. However, Paxos is notoriously difficult to implement correctly — Lamport himself wrote a 2001 paper titled "Paxos Made Simple" to address the confusion.

Raft: Consensus for Understandable Systems

Raft (Ongaro & Ousterhout 2014) was designed explicitly for understandability. It decomposes consensus into three relatively independent sub-problems: leader election, log replication, and safety. Raft elects one strong leader at a time; all writes go through the leader, which simplifies reasoning about the system.

Loading diagram...

Raft node state machine — nodes cycle between Follower, Candidate, and Leader

Terms are Raft's logical clock. Each term begins with an election. If a follower receives no heartbeat within an election timeout (typically 150–300 ms), it increments its term, becomes a candidate, and requests votes. A node grants its vote if the candidate's log is at least as up-to-date as its own. The first candidate to win a majority becomes leader.

Log Replication

Once a leader is elected, all client writes arrive at the leader, which appends the entry to its log and sends `AppendEntries` RPCs to followers in parallel. An entry is committed once a majority of nodes have written it to their logs. The leader then applies the entry to its state machine and responds to the client. Followers apply committed entries during the next `AppendEntries` heartbeat.

pseudocode

// Raft Leader — handling a client write
function handleClientWrite(value):
  entry = { term: currentTerm, index: log.length + 1, value }
  log.append(entry)

  // Send AppendEntries to all followers in parallel
  responses = parallel_send_all(followers, AppendEntries {
    term: currentTerm,
    leaderId: self.id,
    prevLogIndex: entry.index - 1,
    prevLogTerm: log[entry.index - 1].term,
    entries: [entry],
    leaderCommit: commitIndex
  })

  // Commit when majority acknowledges
  successCount = 1  // count self
  for r in responses:
    if r.success: successCount++

  if successCount > (clusterSize / 2):
    commitIndex = entry.index
    applyToStateMachine(entry)
    return SUCCESS
  else:
    return RETRY

Paxos vs Raft Comparison

Property	Paxos (Multi-Paxos)	Raft
Leadership	Implicit — any node can propose	Explicit strong leader
Log holes	Possible — entries can be sparse	No holes — sequential
Understandability	Notoriously complex	Designed to be simple
Leader election	Ad-hoc, no built-in mechanism	Term-based, randomized timeout
Reads from follower	Complex — requires lease/quorum	Supported with lease reads
Used in	Google Chubby, Apache Zookeeper (ZAB variant)	etcd, CockroachDB, TiKV, Consul

Real-World Systems

etcd (Kubernetes): Uses Raft for storing cluster state. Every Kubernetes control-plane decision goes through etcd consensus.
CockroachDB: Uses Raft per-range (64 MB shards of the keyspace), giving per-shard fault tolerance.
Apache Kafka (post-3.0): Replaced ZooKeeper with KRaft (Kafka Raft) for its own metadata management.
Google Spanner: Uses Paxos per shard combined with TrueTime for externally consistent transactions.
Consul: Uses Raft for service catalog and key-value storage; handles up to ~3 data centers natively.

💡

Interview Tip

When asked about consensus in an interview, go beyond just naming Raft/Paxos. Explain the core insight: a majority quorum (N/2 + 1 nodes) guarantees that any two quorums overlap in at least one node, so the latest committed value is always known. Mention that Raft is used in etcd and therefore underpins every Kubernetes cluster. If the interviewer probes deeper, discuss how read-heavy systems use lease-based reads to avoid quorum round trips on reads.

Sidecar for Observability

Distributed Transactions: 2PC & 3PC