How the Incident Retrospective Helps Indeed Deliver Constant Change Safely

Indeed Engineering
Indeed Engineering
Published in
2 min readSep 17, 2019

This talk was held on Wednesday, August 21, 2019.

Outages happen. Products break. Every time a failure occurs, it’s an opportunity to learn and improve. Web-based products are incredibly complex. By understanding and managing their complexity, carefully investigating incidents, and improving responses, we can build more reliable products and more resilient systems.

Alex Elman at the podium giving his presentation.

Site reliability engineering manager, Alex Elman, uses a recent incident at Indeed to demonstrate the benefits of the incident retrospective. Many high-profile events are associated with a seemingly innocuous change. A single change, however, rarely causes an incident alone. By conducting thorough reviews, organizations can learn a lot about how their systems respond to failure. Applying these lessons helps organizations increase the capacity of their systems to adapt and absorb change.

Audio Description

The following video includes a descriptive audio track for this talk.

Transcripts

Speaker

Alex Elman has studied and practiced resilience engineering at Indeed for seven years. His goal: reduce failure within distributed systems to a boring nonevent. Even after moving into a leadership role, Alex continues to carry a pager because he believes that incident response is always a valuable learning opportunity.

Cross-posted on Indeed Engineering Blog.

--

--