How the Incident Retrospective Helps Indeed Deliver Constant Change Safely
This talk was held on Wednesday, August 21, 2019.
Outages happen. Products break. Every time a failure occurs, it’s an opportunity to learn and improve. Web-based products are incredibly complex. By understanding and managing their complexity, carefully investigating incidents, and improving responses, we can build more reliable products and more resilient systems.
Site reliability engineering manager, Alex Elman, uses a recent incident at Indeed to demonstrate the benefits of the incident retrospective. Many high-profile events are associated with a seemingly innocuous change. A single change, however, rarely causes an incident alone. By conducting thorough reviews, organizations can learn a lot about how their systems respond to failure. Applying these lessons helps organizations increase the capacity of their systems to adapt and absorb change.
Audio Description
The following video includes a descriptive audio track for this talk.
Transcripts
- Basic transcript (includes audio information only)
- Descriptive transcript (includes audio and visual information)
Speaker
Alex Elman has studied and practiced resilience engineering at Indeed for seven years. His goal: reduce failure within distributed systems to a boring nonevent. Even after moving into a leadership role, Alex continues to carry a pager because he believes that incident response is always a valuable learning opportunity.
Cross-posted on Indeed Engineering Blog.