This article serves as a comprehensive guide to consensus protocols in distributed systems, explaining fundamental concepts, challenges, and core terminology. It delves into the 'three evils' of distributed computing (asynchrony, partial failures, network partitions) and the theoretical impossibilities like the Two Generals Problem, Byzantine Generals Problem, and FLP Impossibility Theorem. The article outlines how real-world protocols like Paxos and Raft address these challenges to ensure safety, liveness, and fault tolerance.
Read original on Dev.to #systemdesignConsensus is the critical problem of getting a group of distributed processes to agree on a single value or sequence of values, even amidst failures, message delays, or network partitions. This agreement is fundamental for distributed systems to function correctly, ensuring consistency for tasks like leader election, shared log updates, distributed transaction outcomes, and cluster configuration management. The inherent difficulty stems from the 'three evils' of distributed computing: Asynchrony (no global clock, arbitrary message delays), Partial failures (some nodes fail while others operate), and Network partitions (network splits into isolated groups). Any robust consensus protocol must operate correctly despite all three simultaneously.
Several theoretical results highlight the inherent challenges of consensus:
┌─────────────────────────────────────────────────────────────┐
│ CONSENSUS PROPERTIES │
│ │
│ ┌──────────┐ ┌──────────┐ ┌─────────────────────┐ │
│ │ SAFETY │ │ LIVENESS │ │ FAULT TOLERANCE │ │
│ │ │ │ │ │ │ │
│ │ "Nothing │ │ "Something│ │ "Works despite F │ │
│ │ bad │ │ good │ │ node failures" │ │
│ │ ever │ │ eventually │ │ │ │
│ │ happens"│ │ happens"│ │ CFT: F < N/2 │ │
│ │ │ │ │ │ BFT: F < N/3 │ │
│ └──────────┘ └──────────┘ └─────────────────────┘ │
│ │
│ + Agreement: All correct nodes decide the same value │
│ + Validity: The decided value was proposed by some node │
│ + Termination: All correct nodes eventually decide │
└─────────────────────────────────────────────────────────────┘To achieve consensus, protocols typically rely on several core mechanisms:
Real-World Stakes
Without robust consensus, systems face severe issues like split-brain scenarios (two leaders), data loss, double-spend vulnerabilities, stale reads, and cluster membership chaos. Historic outages at LinkedIn (2011) and MongoDB (2012) underscore the critical importance of strong consensus.