InfoQ Architecture·March 31, 2026

Building Resilient Software Systems: Lessons from Real-World Failures

This podcast explores how real-world incidents offer invaluable insights into software system behavior and resilience, often surpassing the utility of synthetic fault injection tools like Chaos Monkey. It emphasizes the importance of learning from unexpected failure modes and the systemic flaws rather than blaming individuals. The discussion highlights that adding reliability can inadvertently increase system complexity, leading to new types of failures.

Distributed Systems DevOps & SRE Performance & Scaling

Read original on InfoQ Architecture

The Limits of Synthetic Fault Injection

The conversation begins by distinguishing between automated fault injection tools and the knowledge gained from mitigating complex, real-world failures. While tools like Chaos Monkey are effective for introducing basic robustness and regression testing, they are limited. They typically focus on single failure modes (e.g., instance termination, RPC failures) and cannot replicate the complex confluence of events that characterize most real incidents.

💡

Chaos Engineering: Purpose and Pitfalls

Chaos Monkey and similar tools are excellent for forcing architects to design for known failure modes (e.g., ensuring statelessness, cluster redundancy) and for regression testing. However, they are generally insufficient for discovering unknown failure modes or the intricate interactions that cause large-scale outages. Real incidents are often too messy and multi-faceted to be easily reproducible synthetically.

Learning from Real Incidents for Architectural Improvement

The most effective way for architects to learn and improve system design is by engaging directly with real incident reviews and postmortems. These events provide deep insights into how systems operate under stress, how they are actually used (often in unexpected ways), and how evolving architectures can introduce new vulnerabilities. Blaming individuals is counterproductive; instead, the focus should be on uncovering systemic flaws and rational decisions made within limited information.

Complexity and Unknown Failure Modes

A critical insight is that increasing system reliability can paradoxically lead to increased complexity, which in turn can introduce new, unknown failure modes. While engineers are proficient at protecting against known failure patterns, building systems resilient to *unknown* failure modes and changes in the external environment or evolving design remains a significant challenge. This requires a holistic view, contrasting with traditional software engineering's focus on subsystems.

Prioritize Real Incident Analysis: Architects should actively participate in incident review meetings and read postmortems to understand actual system behavior and failure patterns.
Beyond Known Failures: Focus design efforts not just on robustness against known failure modes, but on building resilience that can gracefully handle unexpected interactions and external changes.
Holistic System View: Reliability engineering requires considering the entire system, its interactions, and organizational complexity, rather than solely optimizing individual components.

Bridging the Gap: Organizational and Human Factors

The discussion touches on the underappreciated role of organizational complexity in understanding software failures and the build vs. buy decision. It also highlights the need to better disseminate the principles of software reliability engineering, suggesting storytelling as a potential method for sharing knowledge and fostering a culture of learning from failures across an organization.

resilience engineeringchaos engineeringfault toleranceincident managementpostmortemsoftware architecturereliabilitysystem complexity