Postmortem / Correction of Error (CoE) template
Below is a template that I’ve developed over the years of building and operating online services. It’s been inspired by various public postmortem templates and shortcomings/lessons I’ve learned through trial and error.
In an ideal world, failures don’t happen. Until we live in that world, the intention behind the postmortem template is to aid teams in structuring an incident review and facilitating conversation around what can be done better. It’s an opportunity to look at the team, process, and technology in an introspective way and discuss with an open mind on how the customer can be better served. These postmortems provide a durable artifact that can be used to share information, train other ops responders, and provide context for future work items coming from the incident.
A markdown version of the template can be found on Github here: https://github.com/JDHarris007/coe
<Incident Name>
OPS Issue: <Link to OPS issue(s)>
Authors: <Name of CoE authors>
Pages/escalations before first accept: <Number of team members the escalation went through before somebody accepted>
Time to first response: <How long from first page until a team member accepted and responded>
Number of team participants:
Incident Description
Provide a high level description of what happened.
How was the incident detected?
Did we find out about this from an existing alarm? Did we find out from user complaints? Did we find out from some other method?
What were the symptoms / impact?
What behavior did we see? Who was impacted? What services were impacted? What changes did we see on our dashboards/graphs?
What discovery or investigation was done?
What process was used? What tools were used? What did you find in each tool? What were your hunches or assumptions? What did you rule out? Was there an existing runbook addressing confirmation and mitigation steps?
Timeline
Incident timeline, from the time an issue started, through customer impact, ending with incident resolution. Which tasks had a positive impact on the outcome? Which tasks had a negative impact on the outcome? Which tasks had no impact on restoring service?
5-whys and Root Cause
Start with asking why the failure happened. Keep asking why until you get to the root cause.
Q.
A.
Q.
A.
Q.
A.
Q.
A.
Q.
A.
Root Cause:
How was the issue resolved?
How did the issue get fixed and what resources or teams were required to do so?
Were there existing backlog items for this issue? Was this a known failure mode?
Did we already know about this potential failure mode? Did we have scheduled work? Were there backlog items that would have prevented this issue?
Overall learnings and recommendations
How can we do better (in alarms, process, automation, response, etc)? What could we have done to prevent this issue from occurring? How do we make sure this never happens again? What can we do to improve how the incident was handled? Think big, think outside the box.
What went well?
What went wrong?
Where did we get lucky?
Recommendations
What are the actionable tasks and follow-ups?
What follow up actions are we taking? What scheduled work was created? What runbooks need to be updated or created? Include links to tickets, owners, and due dates.