Course/Distributed Systems/Leader Election

Leader Election

Choosing a coordinator: bully algorithm, ring algorithm, ZooKeeper-based election, and the relationship between leader election and consensus.

12 min readHigh interview weight

Why Leader Election?

Many distributed systems need exactly one coordinator at a time to avoid conflicts: a primary database node, a Kafka partition leader, a scheduler picking up jobs. Leader election is the process of automatically designating one node as the leader when none exists (system startup) or when the current leader fails. Done correctly, leader election guarantees safety (at most one leader at a time) and liveness (a new leader is eventually elected).

⚠️

Safety vs Liveness tension

You cannot simultaneously guarantee safety (only one leader) and liveness (always make progress) under network partition — this is a direct consequence of the CAP theorem. Most production systems choose safety: they'd rather have no leader than two leaders. Two leaders is called split-brain and can cause data corruption.

Bully Algorithm

The Bully algorithm (Garcia-Molina 1982) elects the node with the highest ID as leader. When a node detects the leader has failed (or on startup), it starts an election by sending an `ELECTION` message to all nodes with higher IDs. If no higher-ID node responds, it declares itself leader and sends `COORDINATOR` to all lower-ID nodes. If a higher-ID node responds, it takes over the election.

Node P detects leader failure; sends ELECTION to all nodes with ID > P.
If no response within timeout, P wins: sends COORDINATOR to all nodes.
If node Q (ID > P) responds with OK, Q takes over the election.
Q repeats step 1 with nodes with ID > Q.
The highest-ID alive node always wins — hence 'bully.'

ℹ️

Bully algorithm complexity

Worst case: O(N²) messages (when the lowest-ID node initiates election). Best case: O(N) messages (when the second-highest node initiates). Bully works fine for small clusters but does not scale to hundreds of nodes.

Ring Election Algorithm

The ring algorithm arranges nodes in a logical ring. When a node starts an election, it sends an election message with its ID clockwise around the ring. Each node compares the arriving ID to its own, forwarding the higher of the two. When the message makes a full circuit and returns to the initiator, the highest ID wins and a `LEADER` message is sent around the ring.

Loading diagram...

Ring election: the message carries the highest seen ID; Node D (ID=5) wins

ZooKeeper-Based Leader Election

Production systems rarely implement raw election algorithms. Instead, they use a coordination service like ZooKeeper or etcd which provides a reliable primitive: ephemeral sequential znodes (ZooKeeper) or leases (etcd).

All candidate nodes create an ephemeral sequential znode under `/election/`, e.g. `/election/n_0000000001`, `/election/n_0000000002`.
Each node lists all children and checks if it has the smallest sequence number.
If yes — it is the leader.
If no — it sets a watch on the next-smaller node (not all nodes — avoids herd effect).
When the current leader's session expires (leader crashes), its ephemeral znode is deleted, the watching node is notified and becomes the new leader.

Lease-Based Election with etcd

etcd provides a `campaign` API (based on its election package) that implements leader election using leases — time-bounded locks stored in etcd's Raft-backed key-value store. A node acquires leadership by creating a key with a TTL lease. It must periodically renew the lease. If it crashes, the lease expires and another candidate acquires it.

// etcd leader election example (Go)
func runLeaderElection(client *clientv3.Client) {
    session, _ := concurrency.NewSession(client, concurrency.WithTTL(5))
    defer session.Close()

    election := concurrency.NewElection(session, "/leader")
    ctx := context.Background()

    // Campaign blocks until this node is elected
    if err := election.Campaign(ctx, "my-node-id"); err != nil {
        log.Fatal(err)
    }
    log.Println("I am the leader!")

    // Do leader work...
    // Resign when done or before shutdown
    election.Resign(ctx)
}

Leader Election and Consensus

Leader election IS consensus: getting all nodes to agree on which node is leader. This is why Raft bundles leader election into its consensus protocol — you can't have one without the other. ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), a Paxos variant, as its internal consensus protocol to make leader election decisions reliably.

💡

Interview Tip

A common interview question: 'your service needs exactly one instance running at a time (a cron job, a master process). How do you implement this?' The answer is distributed leader election using etcd or ZooKeeper. Describe the ephemeral key pattern: the winning node holds a lease-backed key, all others watch it. When the leader crashes, the key expires, a new election triggers. Mention that Kubernetes uses etcd leases for its own controller-manager and scheduler leader election — great real-world anchor.

Gossip Protocol

Split-Brain Problem