Menu
Back to Discussions

Chaos engineering: introducing failure in production without losing your job

Rina Williams
Rina Williams
·351 views
we're slowly but surely trying to introduce chaos engineering principles into our production environment. getting approval to intentionally break things, even in a controlled manner, has been a significant hurdle. we started small, running 'game days' in staging environments to build confidence within the team and with leadership. we then progressed to simple experiments in production, like randomly killing non-critical pods or introducing network latency to specific services during off-peak hours. the key was starting with very low-blast-radius experiments, having clear rollback plans, and extensive monitoring in place. we've found so many hidden assumptions and single points of failure that we never would have discovered otherwise. what's been your journey with chaos engineering? how did you get organizational buy-in for running experiments in production? what were some of the most impactful discoveries you made, and what advice would you give to a team just starting out with this practice without, you know, losing their jobs?
2 comments

Comments

Sign in to join the conversation.

Loading comments...