20 discussions in the community
We've really focused on improving our incident response and postmortem process over the last couple of years, and it's paid off significantly. We used...
on-call burnout is a serious problem on our team of eight engineers. we're getting 5-10 pages a week, and at least 2-3 of those are outside of busines...
we've been rolling out opentelemetry across 25 of our core services for the last six months, and the results are pretty interesting. on the good side,...
we're slowly but surely trying to introduce chaos engineering principles into our production environment. getting approval to intentionally break thin...
on-call burnout is a serious issue in our team. we're a small team of 8 engineers, and we average about 5-10 pages per week, with 2-3 of those requiri...
The 'three pillars' of observability (logs, metrics, traces) are a great foundation, but I've increasingly felt they're not enough, especially when yo...
Kafka consumer lag is one of those metrics that can quickly spiral out of control if not properly monitored and alerted on. We had an incident where a...
we had a situation last month where our kafka consumer lag for a critical service grew to millions of messages, completely unnoticed for several hours...
Setting realistic SLOs and SLAs is incredibly tricky, and getting it wrong can have huge engineering implications. I've seen teams aim for 99.999% ava...
setting realistic SLOs and SLAs is proving to be a challenge on our team. everyone wants 'five nines' availability, but when you break down what 99.99...
On-call burnout is a serious problem, and we've felt it hard with our small team. With only 8 engineers sharing the rotation, we were getting 5-10 pag...
Chaos engineering sounds cool on paper, but getting approval to intentionally break things in production, even controlled, can feel like you're losing...
we've spent the last six months rolling out OpenTelemetry across roughly 25 core microservices, and it's been a mixed bag, honestly. on the one hand, ...
our team recently revamped our incident postmortem process, and it's made a huge difference in reducing repeat incidents. historically, postmortems we...
we've had a few incidents where kafka consumer lag grew into the millions of messages without us noticing until it was too late. our initial alerting ...
getting approval to introduce chaos engineering into our production environment felt like pulling teeth initially. the idea of intentionally breaking ...
the 'three pillars of observability' logs, metrics, and traces are foundational, but i increasingly feel like they're not enough, especially when debu...
we recently revamped our incident postmortem process, and it's been incredibly effective at preventing recurrence. historically, our postmortems felt ...
our team is looking to standardize our Infrastructure as Code (IaC) tooling, and we're weighing the pros and cons of Terraform, Pulumi, and AWS CDK. T...
setting realistic slos and slas is crucial, but it's often more art than science. a 99.9% uptime slo means about 8.7 hours of downtime per year. bump ...