How the Incident Retrospective Helps Indeed Deliver Constant Change Safely

Indeed Engineering
Sep 17, 2019 · 1 min read

This talk was held on Wednesday, August 21, 2019.

Outages happen. Products break. Every time a failure occurs, it’s an opportunity to learn and improve. Web-based products are incredibly complex. By understanding and managing their complexity, carefully investigating incidents, and improving responses, we can build more reliable products and more resilient systems.

Site reliability engineering manager, Alex Elman, uses a recent incident at Indeed to demonstrate the benefits of the incident retrospective. Many high-profile events are associated with a seemingly innocuous change. A single change, however, rarely causes an incident alone. By conducting thorough reviews, organizations can learn a lot about how their systems respond to failure. Applying these lessons helps organizations increase the capacity of their systems to adapt and absorb change.

Speaker

Alex Elman has studied and practiced resilience engineering at Indeed for seven years. His goal: reduce failure within distributed systems to a boring nonevent. Even after moving into a leadership role, Alex continues to carry a pager because he believes that incident response is always a valuable learning opportunity.

Cross-posted on Indeed Engineering Blog.

Indeed Engineering

Stories from Indeed Engineering

Indeed Engineering

Written by

We help people get jobs.

Indeed Engineering

Stories from Indeed Engineering

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade