This article outlines fundamental distributed system concepts essential for system design interviews and practical application. It covers scalability, reliability, consistency, availability, and partition tolerance, explaining how these interact and the trade-offs involved, particularly through the lens of the CAP theorem. Key distributed patterns like sharding and replication are introduced with real-world examples.
Read original on Dev.to #systemdesignA distributed system comprises multiple computers working in concert to appear as a single, coherent system to users. The inherent challenges lie in coordinating across network boundaries, handling inevitable failures, and managing data spread across various locations. System design interviews often assess an engineer's ability to reason through these complexities, explicitly clarifying requirements, estimating capacity, making trade-offs, and communicating their architectural decisions.
CAP Theorem
The CAP theorem states that a distributed system can achieve at most two of the following three guarantees: Consistency, Availability, and Partition Tolerance. Since partitions are inevitable in real-world distributed systems, the practical choice often boils down to prioritizing either Consistency (CP) or Availability (AP) during a network partition.
CP Systems (Consistency over Availability): Prioritize consistency. If a partition occurs, the system will refuse requests to avoid serving inconsistent data. Examples include traditional banking systems (e.g., ensuring no overdrafts) and MongoDB's default configuration for writes during primary node isolation.
AP Systems (Availability over Consistency): Prioritize availability. During a partition, both sides of the system continue serving requests, reconciling conflicts later. Social media feeds (e.g., Instagram posts appearing eventually) and Amazon DynamoDB for shopping carts are examples, where momentary staleness is acceptable for continuous service.
Many modern systems offer tunable consistency, allowing architects to select the appropriate consistency level per operation (e.g., Cassandra's `QUORUM` for stronger consistency or `ONE` for higher availability). This highlights the importance of understanding specific application requirements to make informed consistency trade-offs.