Slack Engineering·October 7, 2025

Improving Deployment Safety at Slack: Reducing Customer Impact from Change

This article details Slack's Deploy Safety Program, an initiative focused on systematically reducing customer impact from software deployments across hundreds of internal services. It highlights the architectural and process changes implemented to achieve significant reductions in incident severity and duration, emphasizing the shift from individual system fixes to a holistic, metric-driven approach for enhanced reliability and continued development velocity.

DevOps & SRE Performance & Scaling Distributed Systems

Read original on Slack Engineering

The Challenge of Change-Triggered Incidents

Slack, with its growing mission-criticality for customers, faced a significant challenge: 73% of customer-facing incidents were triggered by Slack-induced changes, particularly code deploys. This problem was exacerbated by a diverse ecosystem of hundreds of microservices and varying deployment systems and practices. Previously, reliability efforts often targeted individual deploy systems, leading to manual, slow processes that hindered innovation and engineering morale. The need was clear for a programmatic, systematic approach to improve deployment safety across the entire engineering organization without sacrificing development velocity.

North Star Goals and the Deploy Safety Manifesto

To address these issues, Slack defined ambitious "North Star" goals for its highest importance services, which later evolved into a comprehensive Deploy Safety Manifesto applicable to all systems. These goals focused on:

Reducing impact time: Automated detection and remediation within 10 minutes, manual within 20 minutes.
Reducing severity of impact: Detecting problematic deployments before they affect 10% of the fleet (blast radius control).
Maintaining development velocity: Ensuring safety improvements do not slow down the pace of innovation.

ℹ️

The Deploy Safety Metric

A crucial aspect was defining a metric to measure success: "Hours of customer impact from high severity and selected medium severity change-triggered incidents." This metric aimed to be an analog for customer sentiment, even if imperfect, and required careful filtering and ongoing validation to ensure it accurately reflected real customer experience and program effectiveness.

Investment Strategy and Architectural Interventions

Slack adopted a flexible investment strategy, biasing for action and focusing initially on areas of known pain, particularly the webapp backend. Projects aimed at earlier detection, improved automatic/manual remediation, and reduced issue severity through isolation boundaries. A key architectural shift involved investing in automatic metrics-based deployments and rollbacks, unifying diverse deployment systems, inspired by patterns like AWS Pipelines, with centralized orchestration.

deployment safetycontinuous deliveryreliability engineeringincident managementrollbackmetricsmicroservicesblast radius

Comments

Loading comments...

Architecture Design

Design this yourself

Design a deployment system for a large-scale, mission-critical SaaS platform (like Slack, with hundreds of microservices and diverse deployment methods) that prioritizes "deploy safety." Focus on architectural elements and practices that enable rapid detection and automatic remediation of change-triggered incidents within 10 minutes, limit blast radius to less than 10% of the fleet, and maintain high development velocity. Describe the monitoring, rollback mechanisms, and organizational alignment needed.

Practice Interview

Focus: deployment safety program

Other design angles

· Design a robust incident management system for a complex distributed environment, emphasizing rapid detection, classification, and automated remediation workflows for deployment-related issues.· Architect a comprehensive CI/CD pipeline that integrates advanced deploy safety mechanisms, including canary deployments, automated rollbacks based on real-time metrics, and pre-deployment verification for a microservices architecture.· Develop a strategy for migrating an existing legacy deployment system with manual processes to a unified, automated, metrics-driven deployment platform across multiple teams and service types, focusing on change management and adoption.