I’d like to express my gratitude to my colleague and friend Arni Birgisson for his valuable feedback.
Since I published my blog series Towards Operational Excellence, I received a relatively large amount of feedback. But one question, in particular, stood out.
“Can you share an incident postmortem template?”
In this blog post, I will share an example incident postmortem template, which I hope will help you get started. I will also share some DOs and DON’Ts that I have seen work across a wide variety of customers — both internally in Amazon, and externally.
What is a postmortem?
A postmortem is a process where a team reflects on a problem — for example, an unexpected loss of redundancy, or perhaps a failed software deployment — and documents what the problem was and how to avoid it in the future.
“Postmortems are not about figuring out who to blame for an incident that happened. They are about figuring out, through data and analysis, what happened, why it happened, and how it can be stopped from happening again.” — Arni Birgisson
At Amazon, we call that process Correction-Of-Errors (COE), and we use it to learn from our mistakes, whether they’re flaws in tools, processes, or the organization.
We use the COE to identify contributing factors to failures and, more importantly, drive continuous improvement.
To learn more about our COE process, please check out my favorite re:Invent 2019 talk from Becky Weiss, a senior principal engineer at AWS.
Incident Postmortem Template
Below is an example of an incident postmortem template.
I do not claim that this template is perfect — just that it’s an example that can help get started.
If you think something is missing, if you agree or disagree strongly about a particular part of that template, please share your feedback with me by leaving a comment below.
For all the let-me-get-straight-to-the-point champions out there — here is a bare-bone template.
In this extended-cut version, I will expand on each of the different parts of the template, suggesting what could belong to each section.
Descriptive title (Service XYZ failed, affecting customers in the EU region)
Date of the event.
Name of the owner of the postmortem process.
List of people that will verify the quality of the postmortem before publishing it.
List of tags or keywords to classify the event and facilitate future search and analysis.
Example: Configuration, Database, Dependency, Latent
A summary of the event.
Metric graphs, tables, or other data, that best illustrate the impact of this event.
Discuss customer-impact during the event. Explicitly mention the number of impacted customers.
Incident Response Analysis:
Example of questions you could address:
Was the event detected within the expected time?
How was it detected? (e.g., alarm, customer ticket)
How could time to detection be improved?
Did the escalation work appropriately?
Would earlier escalation have reduced or prevented the event?
How did you know how to mitigate the event?
How could time to mitigation be improved?
How did you confirm the event was entirely mitigated?
Example of questions you could address:
How were the contributing factors diagnosed?
How could time to diagnosis be improved?
Did you have an actual backlog item that could’ve prevented or reduced the impact of this event? If yes, why was this item not done?
Could a programmatic verification rule (e.g., AWS Config) be used to prevent this event?
Did a change trigger this event?
How was that change deployed — automatically or manually?
Could safeguards in the deployment have prevented or reduced the impact of this event?
Could this have been caught and rolled back during the deployment?
Was this tested in a staging environment? If yes, why did this pass through? Could more tests have prevented or reduced the impact of this event?
If this change was manual, was there a playbook? Was that playbook practiced, tested, and reviewed recently?
Did a specific tool/command trigger the event? Could safeguards have prevented or reduced the impact of this event? Was there any safeguard triggered? If not, why none were in place?
Was a production operation readiness or well-architected review performed on the system(s)? If not, why? When was the last evaluation done?
Could a review have prevented or reduced the impact of the event?
Detail all major event points with their time (included the timezone) with a short description.
Example: 09:19 EEST — database run out of connections. Link graph & log
Diving deep on contributing factors:
Start with the problem.
Keep asking questions (e.g., why?) until you get to multiple contributing factors. There is no single cause for failure. So, keeping going!
Probe into different directions — tools, culture, and processes.
NEVER stop at human errors (e.g., if an operator enters a wrong command, ask why no safeguards were in place, or why wasn’t the action peer-reviewed, and why didn’t that command have roll-back?)
Define action items against all contributing factors.
Describe what your team is taking away from this event.
What did you learn that will help you in the future to prevent similar events?
What unexpected things happened?
What process broke down?
Lessons learned should correlate directly, if possible, with an action item.
List of action items with a title, an owner, due date, a priority, and a link to the backlog item created to follow up.
Example: Evaluate shorter timeout for GET API 123, adhorn, July 3rd- 2020, high priority, link to a backlog item.
Things to do when doing a postmortem
- Generally, select senior, experienced owners and reviewers to ensure the high-quality completion of the postmortem.
- Proper postmortems are diving deep on the issues. Nothing is left unanswered unless it becomes an action item.
- Questioning your assumptions, be-aware of heuristics, and fight biases** (see below).
- Reviewers should be fully empowered to reject a postmortem for not meeting a high-quality bar.
- Review recent postmortems in meetings with the broader organization.
- Be smart about what can be accomplished in the short-term, don’t over-promise.
- Use existing postmortems and previous lessons learned to design new “best practice” patterns, and set mechanisms to share the knowledge with the rest of the organization (e.g., present postmortems in weekly operational reviews)
- Codify and automate lessons learned when possible.
- Don’t let postmortems drag on for a long time.
** Heuristics and biases to watch out for (in no particular order):
- The confirmation bias — “the tendency to search for, interpret, favor, and recall information that confirms or supports one’s prior personal beliefs or values.”
- The sunk cost fallacy — “the tendency for people to believe that investments (i.e., sunk costs) justify further expenditures.”
- The common belief fallacy — “If many believe so, it is so.”
- The hindsight bias — “the tendency for people to perceive events that have already occurred as having been more predictable than they actually were before the events took place.”
- The fundamental attribution error — “the tendency to believe that what people do reflects who they are.”
Things to avoid when doing a postmortem
- Don’t blame individuals or teams. Similarly, don’t assign or imply blame to others, individuals, teams, or organizations. Instead, identify what happened and question why those things happened.
- Stopping at an operator error isn’t right. It is a sign that you haven’t gone deep enough. Think about the situation that led the operator to trigger the event? Why was the operator able to do such a thing? Was it a lack of proper tools, a problem in the culture, or a missing process?
- Don’t do postmortems punitively. Don’t do a postmortem if no one is going to get value and find improvements.
- Avoid open-ended questions or action items. Action items such as “create training” and “improve documentation” aren’t useful. Either you didn’t go deep enough, or you didn’t need a postmortem.
- Action items should focus on what can be done in a shorter-term to mitigate the event.
- Don’t try to fix everything in your system in a single postmortem. “We need to change the overall architecture of our system now” or “we need to move to Fortran” aren’t the right action items.
- Do not spend an unreasonable amount of time on writing postmortems. They should be done relatively fast and with a high-quality bar.
- Do not write postmortems on weekends, or in a hurry. It can generally wait the next Monday.
That’s all for now, folks. I hope you’ve enjoyed this post. I would love to hear what works and what doesn’t work for you, so please don’t hesitate to share your feedback and opinions. Thanks a lot for reading :-)