Menu
Back to Discussions

On-call burnout: strategies for sustainable incident management

On-call burnout is a serious problem, and we've felt it hard with our small team. With only 8 engineers sharing the rotation, we were getting 5-10 pages a week per person, and at least 2-3 of those were outside working hours. We actually lost two engineers last quarter, citing burnout as a major factor, which was a huge wake-up call. We've since made a concerted effort to reduce the noise. We found that about 85% of our alerts were non-actionable or redundant. We've spent time tuning thresholds, consolidating similar alerts, and implementing better alert routing. Critical alerts now go directly to Slack, and we have a clearer escalation path before PagerDuty gets involved for less urgent issues. We're also investing heavily in better runbooks, automating common fixes, and building more self-healing capabilities into our services. The goal is to shift from reactive incident management to proactive prevention and automation. It's still a work in progress, but the page volume has already dropped by half, and team morale is noticeably better. It's a marathon, not a sprint.
19 comments

Comments

Sign in to join the conversation.

Loading comments...