Menu
Slack Engineering·October 7, 2025

Improving Deployment Safety at Slack: Reducing Customer Impact from Change

This article details Slack's Deploy Safety Program, an initiative focused on systematically reducing customer impact from software deployments across hundreds of internal services. It highlights the architectural and process changes implemented to achieve significant reductions in incident severity and duration, emphasizing the shift from individual system fixes to a holistic, metric-driven approach for enhanced reliability and continued development velocity.

Read original on Slack Engineering

The Challenge of Change-Triggered Incidents

Slack, with its growing mission-criticality for customers, faced a significant challenge: 73% of customer-facing incidents were triggered by Slack-induced changes, particularly code deploys. This problem was exacerbated by a diverse ecosystem of hundreds of microservices and varying deployment systems and practices. Previously, reliability efforts often targeted individual deploy systems, leading to manual, slow processes that hindered innovation and engineering morale. The need was clear for a programmatic, systematic approach to improve deployment safety across the entire engineering organization without sacrificing development velocity.

North Star Goals and the Deploy Safety Manifesto

To address these issues, Slack defined ambitious "North Star" goals for its highest importance services, which later evolved into a comprehensive Deploy Safety Manifesto applicable to all systems. These goals focused on:

  • Reducing impact time: Automated detection and remediation within 10 minutes, manual within 20 minutes.
  • Reducing severity of impact: Detecting problematic deployments before they affect 10% of the fleet (blast radius control).
  • Maintaining development velocity: Ensuring safety improvements do not slow down the pace of innovation.
ℹ️

The Deploy Safety Metric

A crucial aspect was defining a metric to measure success: "Hours of customer impact from high severity and selected medium severity change-triggered incidents." This metric aimed to be an analog for customer sentiment, even if imperfect, and required careful filtering and ongoing validation to ensure it accurately reflected real customer experience and program effectiveness.

Investment Strategy and Architectural Interventions

Slack adopted a flexible investment strategy, biasing for action and focusing initially on areas of known pain, particularly the webapp backend. Projects aimed at earlier detection, improved automatic/manual remediation, and reduced issue severity through isolation boundaries. A key architectural shift involved investing in automatic metrics-based deployments and rollbacks, unifying diverse deployment systems, inspired by patterns like AWS Pipelines, with centralized orchestration.

deployment safetycontinuous deliveryreliability engineeringincident managementrollbackmetricsmicroservicesblast radius

Comments

Loading comments...