This article details Slack's Deploy Safety Program, an initiative focused on systematically reducing customer impact from software deployments across hundreds of internal services. It highlights the architectural and process changes implemented to achieve significant reductions in incident severity and duration, emphasizing the shift from individual system fixes to a holistic, metric-driven approach for enhanced reliability and continued development velocity.
Read original on Slack EngineeringSlack, with its growing mission-criticality for customers, faced a significant challenge: 73% of customer-facing incidents were triggered by Slack-induced changes, particularly code deploys. This problem was exacerbated by a diverse ecosystem of hundreds of microservices and varying deployment systems and practices. Previously, reliability efforts often targeted individual deploy systems, leading to manual, slow processes that hindered innovation and engineering morale. The need was clear for a programmatic, systematic approach to improve deployment safety across the entire engineering organization without sacrificing development velocity.
To address these issues, Slack defined ambitious "North Star" goals for its highest importance services, which later evolved into a comprehensive Deploy Safety Manifesto applicable to all systems. These goals focused on:
The Deploy Safety Metric
A crucial aspect was defining a metric to measure success: "Hours of customer impact from high severity and selected medium severity change-triggered incidents." This metric aimed to be an analog for customer sentiment, even if imperfect, and required careful filtering and ongoing validation to ensure it accurately reflected real customer experience and program effectiveness.
Slack adopted a flexible investment strategy, biasing for action and focusing initially on areas of known pain, particularly the webapp backend. Projects aimed at earlier detection, improved automatic/manual remediation, and reduced issue severity through isolation boundaries. A key architectural shift involved investing in automatic metrics-based deployments and rollbacks, unifying diverse deployment systems, inspired by patterns like AWS Pipelines, with centralized orchestration.