Post-Mortems

Engineering Insights

Talin
Machine Words
Published in
3 min readFeb 3, 2019

--

Mortui Vivos Docent — “the dead teach the living”

Today’s topic is on writing engineering post-mortems.

A post-mortem document records the details of a serious system failure — what went wrong, and what was done to recover from it. Different companies will have different definitions as to what qualifies as “serious”, but the rule of thumb is that if end users were significantly impacted by the failure, then you’ll probably want to write a post-mortem. Conversely, you probably won’t write post-mortems for failures that occur during QA testing or during development.

What’s the purpose of this exercise? There are several.

The first is that companies need to learn from their mistakes. Having a record of exactly what happened means that painful lessons won’t be forgotten. Ideally, you should have a folder or page on your company’s internal wiki containing all of your post-mortem documents.

Another reason has to do with legal liabilities and corporate governance. If your company is seeking investment, records like this are one of the things that investors will look for. Although you might think that it’s better for an investor not to know about your mistakes, think again: investors are well aware of the difficulties involved with bringing an idea to market, and are likely to be suspicious if you try and whitewash your project’s history.

If you are in a highly-regulated industry, such as makers of medical devices, you will also have to deal with documentation requirements set by regulatory bodies. Those agencies will likely have specific standards for what kinds of reports they expect.

The information contained within a post-mortem document is gathered from all of the people who worked on detecting and solving the problem. In many cases, the team will meet for a post-mortem discussion to discuss what happened; the document is the outcome of this meeting. However, in some cases a single individual will already have all the facts and can author the document without needing to consult with the other team members (although they should all review it afterwards).

Here’s a basic template that you can use for creating a post-mortem document:

[Date] (Title)

Description

(A one-line summary of the incident)

Timeline

  • (date/time) — (failure was first observed)
  • (date/time) — (underlying cause diagnosed)
  • (date/time) — (fix proposed / plan of action agreed on)
  • (date/time) — (fix implemented and deployed)
  • (date/time) — (fix verified)

Analysis

Nature of the problem

(Describe the problem in general terms.)

Description of the specific fault

(Describe the specific symptoms observed. May also discuss the underlying causes. If the problem was the result of a software bug, tell when / why the bug was introduced.)

Impact on end users

(Describe effects on end-users, if any.)

Why the fault wasn’t seen and prevented earlier

(Explain why this behavior was not observed previously.)

Description of the fix

(Describe what steps were taken to fix the problem.)

Recommendations for preventative measures to be taken in the future

(e.g. suggest a test or process change that would have detected the problem.)

--

--

Talin
Machine Words

I’m not a mad scientist. I’m a mad natural philosopher.