This article critically examines traditional change management processes, arguing that they often fail to mitigate risk in modern, high-velocity IT environments. It uses the Swedbank outage as a case study to highlight how compliance-driven change controls can be ineffective and even detrimental to system stability. The piece advocates for a shift towards technical risk reduction through smaller, more frequent changes, automation, and enhanced observability, aligning with DevOps principles.
Read original on High ScalabilityThe Swedbank outage, caused by an unapproved change, serves as a stark example of how traditional change management processes can fail. Despite regulations and significant fines, relying on manual approvals and Change Advisory Boards (CABs) does not inherently reduce operational risk. Research indicates that CABs frequently approve over 90% of changes without significantly improving system stability, and in some cases, external approvals can even negatively correlate with deployment frequency and lead time to restore service, without improving change fail rates.
The Illusion of Control
Traditional change management often creates an illusion of control by focusing on process adherence and documentation, rather than actual risk mitigation. This can lead to a false sense of security where documented but risky changes pass through unnoticed, while undocumented changes pose unaddressed threats.
Instead of focusing on bureaucratic change controls, effective risk management in system design and operations shifts towards making changes inherently less risky. Key strategies include:
These practices are central to a DevOps culture, where the speed of software delivery is harmonized with robust cybersecurity, audit, and compliance demands. The goal is to embed quality and safety into the delivery pipeline rather than gatekeeping at the end.
The article draws parallels between the Swedbank incident and the Knight Capital incident, both highlighting how insufficient observability and traceability of changes in production systems prolonged outages. Runtime monitoring is crucial not just for detecting incidents, but also for identifying unauthorized or problematic changes that bypassed traditional controls. Without it, the full scope of changes and their potential risks remains unknown.