Postmortem — Praise the incident
When the shit hits the fan all of us simply want to dig ourselves deep into the earth and only come out when the storm is over. In this article I write about my experiences with postmortems and why I find it important to embrace every incident that occurs as a chance of getting better.
When an incident happens, of course, everyone first concentrates on fixing the incident. When the systems are running again afterwards, many simply wipe their mouths and carry on as before, but the really good ones stop now and devote themselves extensively to processing the incident in a postmortem.
What’s a postmortem?
A postmortem is a process similar to a forensic team meticulously analyzing a crime scene. The goal is that everyone who is invoked in running the service that has been affected by the incident fully understands what has happened. With this understanding it should then be possible to think about ways to mitigate the incident in case the initial cause happens again.
The answers to these questions are essential:
- What exactly happened?
- What did the chain of events look like?
- What was in line with expectations?
- What was unusual?
The goal is not to find a “guilty” individual but to document the steps that led to the incident: How the incident was discovered and how it has been mitigated. Finally make that as transparent as possible so you can discuss how to get better and find the next steps of improvement.
What if it was a human mistake?
My grandpa used to say: “Only the one who does nothing isn’t doing something wrong — and even that one isn’t doing it right.”
We are all doing mistakes every once in a while. It is important to ignore who did the mistake and to concentrate on why the error has been made and more important why this error could lead to the incident.
Think about it: Even if there is a red button, maybe it is not a good idea that bombs go off immediately if anyone pushes it accidentally? Maybe it is better if you have to push two buttons or enter a code or even better two separate individuals have to do something simultaneously?
Do good and talk about it
So after you documented the chain of events, discussed what could be done better and created tickets for the next steps don’t forget to talk about what you did. Write a short summary with the root cause and the steps you want to mitigate it in the future and send it out to the whole company.
I can’t stress enough that this is an essential step in building trust across departmental boundaries. Above all, colleagues whose focus is not primarily on operations benefit enormously from this. It shows that you are not hiding anything, that you are not covering anything up because you are not working professionally.
Don’t think that it is embarrassing to the postmortem report to the whole company. Even if you lately had more incidents than usual, it only shows, that you may profit from more resources and that definitely is something that you want to have addressed.
Be transparent and give other departments the chance to talk to you. It builds trust and gives others the chance to help you so the whole company gets better in what it is doing.
More information
If you want to dig deeper into how to do postmortems visit https://www.atlassian.com/incident-management/postmortem