Triumphing Postmortem Killers

Published in

Wix Engineering

9 min readOct 23, 2019

One of my (Jewish) new year’s resolutions is that every moment of revelation I experience, I will try and turn into a blog post, to share this knowledge or perspective with the world.

It’s been almost three weeks after new years and 2 blog posts under my belt, so I’ve been keeping my word. So far :)

If you want to catch up on my previous blog post on wix engineering culture and growth read here.

Current revelations I’d like to share are in regards to postmortems. Less about what they are and how important they are — since I assume this is common knowledge by now, or at least highly accessible knowledge. Instead I’d like to focus on what kills postmortems (pun intended) and how we at Wix, battled these silent killers, to make postmortems great and effective again!

“You can’t go back and change the past, so look to the future and don’t make the same mistake twice”

At Wix we deal with thousands of micro-services and hundreds of deployments a day, a multi cloud environment, and numerous internal and external dependencies being used by hundreds of million users.

Things are bound to go wrong and thus, they actually do. Breaking production is part of our daily routine and is almost inevitable.

What is a must though, is learning from these issues, not repeating similar mistakes and continually improving our resilience, otherwise issues and instabilities could repeat and grow bringing us to a halt.

This learning could be achieved mainly by postmortems.

Postmortems allow us to objectively, review the trail of events, find pitfalls and gaps in our procedures, knowledge, products and tools and map ways to improve going forward by either avoiding similar issues or improving detection and resolution time for future issues.

With this being the common knowledge, you would expect everyone to simply conduct constructive postmortems and move on.

Right?

Well, not exactly.

When postmortems die

With good intentions (as well as smaller scale and less production issues) in mind, we used to have a fairly rigid, comprehensive, long template for conducting postmortems. We expected every postmortem to be performed in a meeting with all stakeholders, people involved in the incident and other representatives.

What we found was that these full fledged postmortems rarely happened in reality. The effort required to gather all relevant people, and fill this long template, pushed people away from actually conducting it.

I call postmortem killers ‘silent’, since it’s just one of those things that you figure out without anyone actually telling you about it. They dwindle, and then they die. And that’s bad for your production resiliency.

And the silent killers are…🥁

When postmortems become a practice of finding the person or team to blame, they quickly become an ugly spectacle of defensiveness and finger pointing. That does not allow the trusting, open atmosphere that allows you to objectively review the events and map (together) what can be learned and improved.
If postmortems are perceived as red tape, i.e. when they’re done just to tick the box without the intent of actually learning, people will avoid them or perform the sub-minimal must. By the way, both cases — perception or reality — will result in the same negative outcome.
Ineffective postmortems — when a postmortem stays on the surface of the issue and immediate fix, without drilling into the root cause, how it could be avoided or how handling could have been improved, no action items are taken or pursued, it’s as if the postmortem did not happen at all, or even worse as it has been a waste of participants time…
Lack of visibility — when a postmortem falls in a forest, and no one hears it. Did it actually occur? Not really. If it only stays in the minds of those who participated in the postmortem, not documented or shared anywhere, the knowledge is lost, and the issue or root cause is likely to occur again (and again).

Keep Calm and …

Looking at all these killers, I guess most of them can develop in the multi-participant meeting we had in mind to fill out the template. When we reviewed the process with the teams, we understood that they have been conducting postmortems, but these could take the shape of a team sync on the following daily meeting, a corridor chat, or even happen in the mind of the person who fixed the issue.

Then came the “A ha!” moment — we realized that either of these different postmortem forms is good and achieves the original goal, as long as it’s:

Visible — for others to read and comment
Effective — i.e. maps the actual root cause(s) and action items that can help avoid recurring incidents or shorten the time to fix

So rather than pushing more enforcement over the original procedure, that would probably result in a short lived impact and long lived frustration, we decided to make postmortems easier, more accessible and adaptive to the actual day-to-day of our developers.

Focusing on the “what” rather than the “how”

We decided to focus on making postmortems more visible and effective and less about the format or form in which they were actually conducted.

Automation Automation Automation

Whenever a production issue occurs (e.g., rollback, downtime, user complaint), an incident Jira ticket is automatically opened with all known incident details. The production incident can be closed though only after a postmortem has been conducted and updated.

Conduct them however you choose to

Here was the main change — this is entirely your choice. You can decide if you’d like to hold a big meeting, have a few 1:1 chats, or put your own thoughts on paper.

You have to fill basic fields on the Incident Jira ticket, such as:

Description — high level trail of events
Detected by — testing / monitoring / wix employees/ user complaints
Resolved by — Rollback / restart / fix / miracle ;) / other
Duration — time to detection, time to resolution
Impact
root cause
Postmortem — lessons learned & action items

Once closed it’s automatically published, notifying all subscribers and postmortem angel on call (more on angels in a bit).

The only exception are high impact incidents, for which we expect the full procedure to take place. We’ve set thresholds of impact and number of teams involved to identify these “high impact incidents”. And they are a fairly small fraction of the overall incidents.

Visibility Visibility Visibility

All postmortems are published in a central online board accessible to everyone. Everyone can read and comment.

Interesting outages and postmortems of large impact are updated in our company weekly newsletter.

In addition stats on number of postmortem conducted vs open incidents are also tracked and published weekly.

Blameless

Creating a blameless culture, just like building trust is an evolution rather than a revolution and takes time and care to build.

What helped make rather than break blameless postmortems is reiterating the following assumption:

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.” — Norm Kerth, Project Retrospectives: A Handbook for Team Review

By reiterating, I mean we reminded this in our company wide training, as well as in the postmortems themselves.

To make sure this mantra is not just lip service, and becomes the reality, we either appointed a postmortem facilitator (mainly for large meetings) — they facilitate the meeting and makes sure it does not divert to casting blame or finger pointing.

Another useful practice was teams who agreed on a common team gesture (e.g. waving or shouting “hot potato”) when anyone feels the conversation turns onto a blaming tone — they can shout and help steer the conversation back to it’s blame free, constructive mode.

But most importantly, in my mind, is for this spirit to be applied top-down in addition to bottom-up — i.e. leadership has to walk the walk and openly talk about their mistakes, as well as congratulate teams for taking ownership of a mistake or issue and it’s mitigation. Rather than condemn teams or individuals for mistakes.

After a while of encouraging blamelessness and discouraging blame, trust started to build and people became more willing and happy to share and discuss issues.

Postmortem Angels

Although postmortems are published in a central board, the amount of postmortems was becoming too large for one / few people to review.

Therefore we set up a special squad of postmortem angels. These are production enthusiastic developers, that in addition to their day job have an on call rotation of reviewing new postmortems published, commenting or having a chat with the relevant team when they have insights, questions or comments on any specific postmortem.

In many cases, postmortem angels share info across teams on similar issues found and how to tackle them effectively. You can also invite the postmortem angel to your postmortem — so they can assist in real time.

In addition all postmortem angels meet on a bi-weekly basis to share interesting postmortems, and map common issues and company wide mitigation.

Common issues could be:

Lack of tools for troubleshooting / monitoring specific areas
Recurring issues with external or common internal dependencies
Lack of knowledge of specific infra setup / behavior

And mitigation will typically be either a push for a fix or enhancements of our tools, documentation and infrastructure, or training and sharing sessions we conduct across the relevant guilds.

Here are a few interesting recent examples mapped by postmortem angels:

a memory leak in one of our 3rd parties
recurring mis-configuration of our db clusters
lack of knowledge / documentation on taking memory dumps/ thread dumps on k8s.

The angels are not officers, they are facilitators who help you continually improve the quality and efficiency of your postmortem and help spread common lessons learned across the company. They are not enforcing the implementation of the derived action item — this is the responsibility of each team.

We found that teams do tend to follow up on their own postmortem action items. The reasoning behind it is quite simple: the result of not applying lessons learned is recurrence of incidents and downtime inflicted back at the team and its quality of sleep :) which tends to be the best motivation for improving resilience.

What Happened as a Result

It took a couple of months for new automation to kick in, and the postmortems angels squad to formalize and start efficiently and collaboratively operating, after which we soon started seeing very positive results:

Significantly more postmortems were conducted. And the incident to postmortem ratio keeps rising
Postmortem angels activity initially grew commenting on postmortems, and than (positively) lessened since teams learned to perform more effective postmortems.
Common issues are more effectively highlighted and addressed.

And most importantly:

Teams report a decrease of recurring issues, hence a higher resilience which is what we were all looking to achieve in the first place.

What we’ve learned:

For me, this was another reassurance of a leadership-life lesson I truly believe in: that in order to achieve a common goal across the organization we (as leaders) need to enable rather than dictate. hence:

We need to focus on the “what” rather than the “how” and leave the “how” to the responsibility of each of our talented teams.
Provide facilitation in the form of tools, training, people and automation aimed to make the process effective, yet seamless and easy so that teams could focus on growing their knowledge and improving product resilience rather than in following resilience procedure.

What’s next

Although the results have been very promising — we keep scaling (in people and products) and the challenge grows. We need to remain on our toes to verify we don’t stagnate and ensure postmortems are happening and happening effectively.

Together with our postmortem angels we’re currently drilling into our troubleshooting practices, and looking for ways to improve knowledge and automate more of our troubleshooting & recovery (rollbacks, auto suggested mitigation and automatic mitigation).

So I don’t know what’s coming next yet- but following my new year’s resolution will definitely keep you all posted.

Want to share thoughts on postmortems? learn more about Wix’s unique culture or want to join in on the fun, challenge & growth?

Feel free to reach out to me on avivap@Linkedin