Photo by Jungwoo Hong on Unsplash

Problem Postmortems

We can learn and improve from more than just critical Incidents

Nick Gibbon
Published in
3 min readDec 13, 2021

--

Learning and Improvement

One way to improve and learn in groups is to think and plan, do good things and then share those experiences.

But often things don’t go to plan. It’s common knowledge that we can learn a lot from mistakes and so it’s important to put systems in place to capture this value at a team and organisational level.

Retrospectives

One way to do this is to simply look back over a period of time (or a work cycle); consider what went well and what didn’t and then use this information to do better in the future. In Agile© terminology this is a Retrospective[0,1].

Incident Postmortems

Another great way is to isolate and learn from specific negative events — Incidents — via Blameless Postmortems[3]. We should be continually putting a more positive or at least stoic spin on negative events. There’s no use crying over spilt milk. In fact, there’s no use wasting any negative energy on the past at all. Whatever has happened has happened and now is the time to invest energy into learning and producing a better future.

Excellent! Flailing around and shouting at each other became Post Incident Reviews and they have become cooler, calmer Blameless Postmortems. But I still feel that they are often treated with more pomp and circumstance than is actually helpful. And I get it! Larger incidents attract more senior stakeholders. Even when going at this with the best intentions and mindset this is a serious and important activity. This intensity can make people anxious about postmortems. Was this event bad enough to trigger a postmortem?

Problem Postmortems

I am making the case for triggering more postmortems. Problems occur all of the time inside technology organisations with teams providing and consuming things for each other. Just because there isn’t some massive outage or disaster doesn’t mean that meaningful value can’t be derived.

The trigger criteria is that you should create one whenever you see an opportunity to help others. I suggest after mitigating or resolving some complicated / interesting / sticky problem and also when you can identify a trend in similar problems recurring frequently.

It doesn’t need to be a tome. Lightweight will do. The benefit simply needs to outweigh the cost and I am arguing that if you really think about it there are lots of problems where the benefit would outweigh the cost and therefor lots of opportunities currently being missed.

The cost is always the time to write-up a problem plus the time to discuss it in a group.

The benefit is knowledge sharing and context giving for issues and idea generation for strategic solutions — ultimately meaning increased productivity and reduced errors. Along with the creation of resources for the future to help new people learn and reduce person dependency across time.

These should be very relaxed affairs. Coffee-and-cake-type-stuff. Which should also provide a good avenue for increasing bonding, collaboration and the feeling of product ownership within teams.

Hopefully more common positive experiences in this domain can contribute to a better learning culture and lead to more productive and better postmortems for major incidents too. It’s good practise!

incident.io

I have found a really interesting organisation building Incident-related tooling founded by technical operations experts. They have a good blog and their views on incidents align with my own. Check them out:

Resources

--

--

Nick Gibbon
Pareture

Software reliability engineer & manager in cloud infrastructure, platforms & tools.