Split-Brain Problem
When network partitions divide a cluster: split-brain scenarios, quorum-based prevention, fencing tokens, and STONITH strategies.
What Is Split-Brain?
Split-brain occurs when a network partition divides a cluster into two or more groups, and each group independently believes it is the majority — electing its own leader. Both leaders then accept writes, and when the partition heals, the two partitions have diverged state that may be impossible to automatically reconcile. It is one of the most dangerous failure modes in distributed databases.
For example: a 5-node database cluster partitions into a group of 3 and a group of 2. The group of 2, not knowing about the network partition, elects a new leader and begins accepting writes. The original group of 3 also has a leader accepting writes. When the partition heals, both groups have made conflicting changes to the same data.
Prevention: Quorum-Based Consensus
The primary defense against split-brain is quorum-based consensus: a leader can only be elected and can only commit writes if it has the acknowledgment of a majority (N/2 + 1) of nodes. Since two majorities cannot exist simultaneously in a cluster, split-brain is impossible.
- In a 5-node cluster, majority = 3. The partition of 2 cannot elect a leader or commit writes.
- In a 3-node cluster, majority = 2. A partition of 1 cannot lead; a partition of 2 can still elect a leader.
- This is why odd-numbered cluster sizes are strongly preferred — even-numbered clusters waste one vote for no extra fault tolerance (4-node ≈ 3-node in practice).
| Cluster Size | Majority Required | Nodes That Can Fail | Notes |
|---|---|---|---|
| 3 | 2 | 1 | Minimum for fault tolerance |
| 5 | 3 | 2 | Common production choice |
| 7 | 4 | 3 | Large, high-availability clusters |
| 4 | 3 | 1 | No better than 3-node cluster |
| 6 | 4 | 2 | No better than 5-node cluster |
Fencing Tokens
Even with quorum, a subtle split-brain can occur due to process pauses: a leader is elected with a new term, but the old leader (paused via GC or OS scheduling) resumes and still thinks it is leader. It has no network connectivity to check. This is mitigated with fencing tokens — a monotonically increasing number issued with each leadership grant.
Every write the leader makes must include its fencing token. Storage systems reject any write with a fencing token lower than the highest seen. When a stale leader resumes and tries to write with its old (lower) token, the storage rejects it — preventing the stale leader from corrupting data.
// Storage system with fencing token enforcement
class FencedStorage:
highestSeenToken = 0
function write(key, value, fencingToken):
if fencingToken < highestSeenToken:
return ERROR_STALE_LEADER // Reject stale leader writes
highestSeenToken = fencingToken
data[key] = value
return SUCCESS
// Leader acquires token when elected (e.g., from ZooKeeper)
// leaderToken = zookeeper.getEphemeralNode().sessionId (monotonically increasing)
// All writes include leaderTokenSTONITH: Shoot The Other Node In The Head
STONITH (Shoot The Other Node In The Head) is a hardware-level fencing strategy used in high-availability clustering. When a node is suspected of being a rogue leader, the cluster forcefully powers it off via an out-of-band management interface (IPMI, iDRAC, AWS EC2 terminate API). This is more reliable than software-level fencing because a node that is alive but partitioned cannot defend itself from a power-off command.
Split-brain in practice: MongoDB replica sets
A MongoDB replica set partitions: primary (old) is isolated, secondary + arbiter form a quorum of 2 (from 3) and elect a new primary. Writes to the old primary succeed locally but will be rolled back when the partition heals. MongoDB handles this with its oplog rollback mechanism — but data written to the old primary during the partition is lost. This is why `writeConcern: majority` is critical for durability guarantees.
Interview Tip
When discussing replication or database clusters in interviews, proactively raise split-brain. Explain quorum (majority) as the prevention mechanism, fencing tokens as the backstop against stale leaders, and note that odd cluster sizes are not arbitrary — they maximize fault tolerance per node. If asked about MongoDB, describe `writeConcern: majority` as the safeguard that ensures writes are durable even if the primary fails.