How the Incident Retrospective Helps Indeed Deliver Constant Change Safely
This talk was held on Wednesday, August 21, 2019.
Outages happen. Products break. Every time a failure occurs, it’s an opportunity to learn and improve. Web-based products are incredibly complex. By understanding and managing their complexity, carefully investigating incidents, and improving responses, we can build more reliable products and more resilient systems.
Site reliability engineering manager, Alex Elman, uses a recent incident at Indeed to demonstrate the benefits of the incident retrospective. Many high-profile events are associated with a seemingly innocuous change. A single change, however, rarely causes an incident alone. By conducting thorough reviews, organizations can learn a lot about how their systems respond to failure. Applying these lessons helps organizations increase the capacity of their systems to adapt and absorb change.
Alex Elman has studied and practiced resilience engineering at Indeed for seven years. His goal: reduce failure within distributed systems to a boring nonevent. Even after moving into a leadership role, Alex continues to carry a pager because he believes that incident response is always a valuable learning opportunity.
Cross-posted on Indeed Engineering Blog.