Incident response: our postmortem process that actually prevents recurrence
Sofia Zhang
·256 views
We've really focused on improving our incident response and postmortem process over the last couple of years, and it's paid off significantly. We used to have postmortems that were just a rehashing of events, often pointing fingers, and action items would languish. That's changed.
Our current process is strictly blameless. The focus is entirely on system failures and process improvements, never on individual mistakes. Every postmortem includes clear action items, each with a designated owner and a realistic deadline. These aren't just one-off tasks; they're tracked in a dedicated system and reviewed monthly during our engineering leadership syncs. We also classify incidents by root cause and service, which helps us spot systemic issues.
The result? We've seen a roughly 60% reduction in repeat incidents caused by the same underlying issues in critical services. It's a lot of discipline, but knowing that an incident will lead to concrete, tracked improvements, and not just another fire drill, has boosted team morale and system stability.
11 comments