Postmortem 2.0

Dolev Pomeranz
7 min readSep 22, 2022

--

What happens after an incident occurs? It could be system downtime, data loss, or even a security breach. These types of incidents can occur at every company. Do you feel your organization is efficiently learning from its experience? I specifically ask about the organizational level and not the personal level. What does it even mean to be efficiently learning at the organizational level? For starters, how many times have you had the feeling that the same mistakes are repeating themselves? It’s a clear sign that organizational learning isn’t efficient.

Some organizations have a formal postmortem process. It’s a reflection process that aims to assist us in learning from our experience. However, that doesn’t guarantee great outcomes. In this post, I will share with you battle-tested ideas on how to level up and set up a robust and efficient postmortem process. It will help you avoid major pitfalls that can cripple postmortem processes. It will also provide you with tools to squeeze out all the knowledge that can be gained from such incidents. This, in turn, will help you in avoiding repetitive production failures and improve system health.

Let’s start with the following questions. They will help you evaluate your current situation.

  • How many postmortems did your organization have in the last year?
  • Given a specific postmortem, how long will it take you to understand its current status?
  • What areas of the system yield the most problems?
  • What percentage of postmortem action items were completed on average across all postmortems?
  • Are we currently creating more postmortem action items than resolving them?

How easy is it for you to answer these questions? Implementing the approach below will enable you to answer all of them immediately.

Postmortem 2.0 principals

The ideas I discuss are manifested into a few principles that capture the essence of the Postmortem 2.0 approach.

(1) Mindset

A major prerequisite of successful learning is a positive mindset. If people refrain from formally performing postmortems the entire process is put at risk.

The lava river and the gold river

The day-to-day life of a company provides us with a stream of incidents. Most people have a negative view of such incidents as if it’s a stream of lava. A lava river you should keep away from as much as possible. It’s mainly because these incidents are bundled with negative feelings of failure. This pushes people away from even doing postmortem processes, and the opportunity to gain practical knowledge is lost.

A lava river. (I created this image using the Stable Diffusion Artificial Intelligence model.)
A lava river. (I created this image using the Stable Diffusion Artificial Intelligence model.)

We all have a theory on how things work: a mental model that represents our understanding. But there is a gap between that theory and practice. Albert Einstein once said:

“In theory, theory and practice are the same. In practice, they are not.”

Practical knowledge is more important than theoretical knowledge. It’s a prime source of learning that can even adjust our understanding, shortening the gap between theory and practice. That’s why it’s so important to change the mindset and try to gain as much wisdom from each incident. As if it’s a stream of golden opportunities to embrace. Remember, the incident’s tuition was already paid. You should now harvest what knowledge you can from it.

A river with golden opportunities. Don’t miss them! (I created this image using the Stable Diffusion Artificial Intelligence model.)
A river with golden opportunities. Don’t miss them! (I created this image using the Stable Diffusion Artificial Intelligence model.)

Blame games

The first and second rules of any postmortem process are:

  1. Postmortems are not blame games.
  2. Postmortems are not blame games.

Postmortems must be a safe and constructive process. Otherwise, you’ll find yourself in a “fight club”. It’s very easy to fall into an “us vs them” pattern. Thus, you must align the participating factors with the expected mindset which is key to success.

(2) Single Source of Work

What is the postmortem process deliverable? In many cases, it will be a report that includes a list of action items. This report is manifested as an email or as a document (e.g., PDF). There are some downsides to doing that:

  • An email has its distribution list, but not all can access it.
  • You can’t tell the status of the different action items.
  • You can’t query the data from many reports and present it in any BI tool.

Postmortems in your ticketing system

A way to overcome these downsides is to produce what I define as Single Source of Work (SSoW). Postmortems should be seen as a task like any other task.

Let’s take JIRA as an example, a practical advice is to open a reflection project. This project will be home for all reflections. One type could be postmortems, another could be retrospectives.

Having also the action items as tickets is an important part of an SSoW approach. Think how likely would an action item in an email be completed? How does that change if it’s part of the ticketing system?

The cherry on the icing is apparent when linking action items to the postmortem ticket. It enables a quick and up-to-date view of the action items' statuses:

Reflection ticket linked tickets list
A reflection ticketed linked issues. How easy is it to understand the reflection status?

With a glance, you can see that all action items for this postmortem are done. You don’t need to bother anyone, or read any follow-up email, to understand the current status.

Structuring the data

In addition to being able to link action items, we should structure the postmortem using ticket fields. Here is a small list of fields you can consider using:

  • Assignee
  • Reporter
  • Created
  • Resolution
  • Priority
  • Components
  • Incident Summary
  • Chain of Events

Structured data is easier to query, and thus easier to use in BI tools for monitoring. Notice that some fields could still contain unstructured data (e.g., incident summary and chain of events).

(3) Monitoring

Now that the data is structured it’s time to build some dashboards! I recommend two:

Reflections dashboard — showing widgets like:

  • Reflections by type
  • Resolved vs unresolved reflections
  • Count of reflections per component
  • Count of reflections per assignee

Reflections action items dashboard — showing widgets related only to the action items. For example:

  • Action items by type (bugs / tasks / epics / etc.)
  • Resolved vs unresolved action items
  • Count of unresolved action items by assignee

Here’s an example of a graph that shows the number of action items created vs the ones that are resolved. Hopefully, the trend downwards shows improvement over time, as the system becomes more stable.

A graph showing created vs resolved action items originating from reflection processes.
Reflections action items. Created (red) vs resolved (green) during a period of 900 days.

(4) Reflection meeting

Any postmortem process should have a reflection meeting. This climax is an important ceremony that has several goals:

  1. Sharing awareness — Having all stakeholders on the same page.
  2. Harvesting more action items — Brainstorming more ideas on what needs to be improved. This ensures higher efficiency per postmortem process.
  3. Strengthening the team — Sometimes, stakeholders might need to ventilate. It’s an excellent time to rebuild any broken trust between parties and reinforce a one-team mentality.

The meeting should be moderated by the postmortem owner. A simple structure for the meeting could be:

  • Going over the ticket. Most people won’t invest the time reading the ticket, so the meeting allows us to reach a bigger audience. Share the factual findings and then add your interpretation as the postmortem owner.
  • Doing a round of discussion. It’s important that each participant has the opportunity to share their thoughts. Often, this discussion will lead the team into identifying more action items.

(5) Priority

Now we have action items as tickets, and they are easily tracked. It’s time for them to become first-class citizens in the development process.

I believe it is best to have a fast lane for action items originating from postmortem processes. Whenever we plan the next cycle of work, we make sure to prioritize those tickets.

We implemented this framework at the R&D level. Our program manager pushed teams to allocate 20% of their time for quality-based tasks. Mostly, postmortem action items.

(6) Champions

Assigning some people as postmortem champions is key to the successful adoption of this approach. Their responsibility is to promote high-quality postmortems. They can do it via:

  • Training people — Focusing on postmortem owners that are about to conduct their first postmortem.
  • Joining reflection meetings — It’s an excellent way to improve the outcome of any given postmortem. Doing so by ensuring a positive mindset, and setting an example for others as active participants.
  • Improving the process — It is important to continuously improve the process. Tailoring the approach to your specific organizational culture and needs.

I recommend having staff engineers as champions. Their set of skills, experience, and knowledge usually fits very well with the requirements of such a role.

Summary

The principles above promote 3 key values:

  1. Management — Managing postmortems in a systematic way increases their efficiency.
  2. Transparency — An open process that is easily tracked and accessed. This allows everyone to learn from others’ experience.
  3. Safe environment — Only when people truly believe they can share their real thoughts can constructive learning happen. A positive mindset and a safe environment are building blocks of a healthy working culture.

Closing thoughts

Humans are great at learning, but organizations learn differently. It’s not just the collective learning of the organization’s people. It’s also the changes that happen to its system and processes. An approach like I just described can help increase the amount, speed, and quality of such changes.

However, there is a mental barrier preventing the adoption of such an approach. It’s a classic barrier of change, and changing is not easy. It not only requires an investment from the organization of time and effort. It also requires a leap of faith. Mostly, since it’s not established yet as a best practice.

It won’t be easy to change your organization, but after some pressure and time change follows. When people start experiencing the intrinsic value of this approach they become believers. They understand it’s an investment that eventually pays off.

So, what do you say? Care to invest?

--

--