How to write a postmortem
At some point in their career, every engineer will have to deal with systems that have failed, creating an incident that people had to drop what they were doing and address. After the incident is over, we write a postmortem to help us figure out what went wrong and how we’re going to stop it from happening again.
Much has been written about what postmortems should be, but most of the existing material talks about the desired results. This article is an attempt at explaining the process of writing one, for people who need to do it. My background is in software engineering as an SRE, but I would expect this approach to be applicable in other fields.
The goal
This is the best statement I have come up with to define our goal in writing a postmortem:
A good postmortem is a blameless document about an incident, which can be read by any reasonably experienced engineer. It explains what happened and convinces the reader that the list of action items is a viable plan to prevent any similar incident from happening again.
Postmortems are intended to be read by engineers, including detailed technical content. They should not simplify or skip relevant details, although they can omit discussion of parts of the system which were not involved in the incident. Where possible, they should reference other documentation that the reader can explore for more detail, rather than reproducing it.
Postmortems need to be blameless. This is often misunderstood: a postmortem is blameless if and only if it identifies failures in systems, not people. There is no such thing as “blameless language”, as no amount of editing will change a statement from describing human error to describing the causes of that error.
A postmortem should explain the root causes of the incident, so that other engineers who were not involved can read it and understand what happened and what went wrong. It should have enough detail for the reader to understand why the action items are necessary and sufficient to address the root causes.
It is necessary for the action items to be realistic, but a postmortem should err on the side of setting ambitious goals that might not be completed, rather than limiting the goals to only things which can be done. If there is doubt about whether an action item is achievable, include it. The postmortem is complete when you have a plan that would prevent the incident from happening again, and does not need to include the decision making process about exploring alternatives or prioritising the work.
If a problem cannot be solved, then the postmortem should include action items based on accepting that this problem will keep happening, and what will need to be done as a consequence of that. This is an extraordinary claim to make in a postmortem, so it will need a rigorous explanation.
Assembling the timeline
pThe first thing to write in a postmortem is the timeline. This is a detailed record of exactly what happened, in the order it happened. Collect it as soon as possible after the incident, before information is lost or people involved in the incident forget important details.
You have enough detail to proceed when you can tell:
- What participants observed during the incident
- What participants believed was happening, based on their observations
- Every action that participants took
- How the system responded to those actions
- What events triggered the incident
- When the incident started
- When the incident was detected
- When the harmful effects of the incident were over
Expect to keep returning to the timeline and adding more detail as you work on the postmortem.
While constructing this timeline, you will notice things that went poorly. Make a note of them in the “what went wrong” section of the postmortem, but don’t get distracted by them at this stage.
Assembling the background
The next section of the postmortem to write is the background material, where you collate relevant documentation and training materials. The intended audience of a postmortem is an engineer who is not intimately familiar with the systems which failed, and might need more context to understand what happened. The purpose of this section is to bring everybody up to the same baseline understanding of how the system was supposed to work.
When several people are collaborating on a postmortem, it is likely that they don’t all have the same understanding of these systems, and some of them would find the reference material useful. Once you have the timeline, you should have a reasonable idea which parts of the system are relevant, so this section can now be filled in.
If you are part of a mature team that has written multiple postmortems about the same system, this section should be a simple reference to existing pmaterial. If the system’s behaviour is mysterious, this section is where you document your discoveries about its behaviour.
A sidebar on blame
The concept of blamelessness was mentioned earlier. At points during the investigation of root causes, you are going to identify places where a person made a mistake, because the incident could not have happened if no mistakes had been made. Because you are human, you will instinctively blame the person. Recognise this instinct as a reminder that you haven’t found the root cause yet.
When you find a place where a person made a mistake, look closely at the circumstances. Use the timeline to remind yourself what was and wasn’t known at this point. Reconstruct the things which that person observed at this point in time. From this position, attempt to answer the question: “Why did this seem like the correct course of action?” — the person who made the mistake may be able to help you with this part.
Common reasons for mistakes are that the person couldn’t see a critical piece of information, or they were given too much information so they couldn’t identify the important part. In both cases, the cause is a failure of the system to effectively provide information, which you can include an action item to correct.
One interesting class of mistakes is where the person made an assumption about how the system behaves, which later turned out to be incorrect. The cause you want to get to here is that the system’s behaviour is surprising, to at least some of the people who interact with it, and nothing about the system gives you reason to doubt it. You can correct this by changing the system to be less surprising, or by adding safety checks that identify when somebody is about to make this mistake, and warns them about it.
A particularly insidious version of this is when a person openly blames themselves, and says “I made a mistake, I shouldn’t do that again”. It is emotionally challenging for others to object to this demonstrated humility, but being humble does not make it correct. Be prepared to recognise this pattern and look for the root cause.
Keep this test in mind at all times: blame is when you identify the cause as people, and try to change the people. Blameless is when you identify the cause as problems with the system, and try to change the system.
Determine the root causes
Start with the “what went wrong” list that you have been accumulating. Add anything to it that is missing. Make sure you have covered the primary regrettable outcomes of the incident.
Go through every point on your list, and ask “how did that happen?” repeatedly, until you arrive at a systemic root cause. The usual estimate is that you’ll need to ask “how?” about five times, but it can take more. Recognising when you have reached a systemic root cause can be tricky, but here are a few guidelines:
- A systemic problem is an enduring defect in the system, rather than a single incident. If you are still talking about particular events during this incident, keep asking “how?”
- If you haven’t found some people making a mistake yet, then you probably haven’t looked deeply enough. Most root causes have people in them somewhere. Remember that a blameless postmortem can never stop there, and must look for the systemic problems that caused the mistakes. Answers like “there is this bug in the software” are in this category: keep going until you understand how the bug got there without being detected.
- If the answer has stopped being anything to do with this particular system, you’ve gone too far, and should look back to your earlier answers for the root cause. Answers like “we can’t afford to write more reliable software” are in this category.
Once you’ve found the root cause, work back in the other direction to write the postmortem: tell the story of the series of events that led to the regrettable outcome.
In particularly large or difficult postmortems that involve a lot of people, a useful technique can be to create an “open questions” section, with all the things that you don’t yet know the answers to. This helps to focus the collaborators on the next steps.
Action items
The list of action items is the things you propose doing to rectify theproblems identified. For each problem that you found, there should be one or more action items: at least one that addresses the root cause, and possibly some more to address earlier steps in the chain of causes. A common approach is to have an action item to mitigate the problem quickly by addressing the observable problem, and another to prevent the problem from recurring by addressing the root cause.
Every action item should be clearly related to the problems identified. At this stage, engineers are thinking about ways to change the system, so it is common for them to shift into a project planning mindset and start adding other things they would like to do. A postmortem is not a useful way to plan work unrelated to the incident, so recognise this as a failure mode and correct for it.
Double check every action item to see if it is an attempt to change people instead of systems. Some common ways to disguise these attempts are “train the people” or “write some documentation”. Both of these are ways to write “tell the people involved to act differently next time” without making it obvious that this is a form of blame. It can help to restate the problem as: the way people acted in this incident is empirical data and we need to change the design of the system so that it does not fail when people act this way.
Review
Have the postmortem reviewed by engineers who were not involved in the incident and are not directly responsible for this system. It is difficult to tell whether you have written the document in a way which explains everything coherently, and an independent perspective will find important problems.
My suggested approach is for reviewers to work from this list of questions:
- Are you confident that you understand the postmortem’s explanation of how the outage occurred?
- For each action item, is it clear how this addresses a problem identified in the postmortem?
- For each action item, is it clear how completing this will result in the problem not happening again?
- If all of the action items were completed, are you confident that no similar incident would happen again?
Further material
The SRE workbook has some examples of postmortems, and discussion of what good and bad looks like.
Etsy’s approach to blameless postmortems
Another angle from Salesforce, where I learned to ask “how?” instead of “why?”