SLOs/SLAs in practice: how to set realistic targets
Ingrid Ahmed
·721 views
setting realistic slos and slas is crucial, but it's often more art than science. a 99.9% uptime slo means about 8.7 hours of downtime per year. bump that to 99.99%, and you're down to 52 minutes. the engineering effort and cost to achieve that extra 'nine' is enormous, usually exponentially higher. we're trying to refine our slo definitions.
the challenge is often defining what 'downtime' or 'unavailability' actually means from a user's perspective. is it any api error? slow response times? a partial outage? and how do you factor in maintenance windows? we're also working on setting targets that are ambitious but achievable, avoiding the trap of aiming for four or five nines when the business reality doesn't require it and the engineering cost is prohibitive.
what are your best practices for defining meaningful slos and slas? how do you involve product and business stakeholders in setting these targets? and what's your approach to handling slo violations in a constructive way, beyond just wagging fingers?
2 comments