Enterprise DevOps & SRE: Blameless Postmortems and Blameless Culture

Sonny Dewfall
The Pinch
Published in
5 min readNov 26, 2021

The reason we need a blameless culture in our technology organizations is the same reason we need testing, monitoring, and a support organization — failure.

Failure is inevitable in any technology context. As John Allspaw writes, “Failure cares not about the architecture designs you labor over, the code you write and review, or the alerts and metrics you meticulously pore through.” Failure is a natural part of building and operating systems, especially complex systems.

Our tools and applications are built with this in mind (if they are built well). We have performance, functional and security testing tools. We have resilient architectural patterns. We have self-healing infrastructure. We have tools to scan for issues in Production. Technologically we assume failure will happen at every turn.

A blameless culture is a way of bringing this approach to the human side of the equation. One of the key tenets of SRE is accepting failure, not just in your code but in your processes and mindset. Blameless culture is the implementation of this principle.

Blaming Culture

It is human nature to seek an individual to assign blame to in the event of an issue. Blaming human error is a way to preserve one’s assumptions about how a system works. Putting culpability on an individual rather than dealing with group or systemic faults that are potentially more complex. Blaming has a negative effect on an organization once it becomes systemic.

Once blaming individuals becomes the norm, a culture of blaming, the group can respond with a set of negative behaviors:

Hiding mistakes — in an environment where mistakes are punished there is no incentive to draw attention to issues. Instead, individuals may try to hide their mistakes by fixing things themselves or simply hoping no one notices — often making things worse in the process.

No experimentation — if the cost of failure is high, being held personally responsible for your mistakes, then innovation becomes significantly less attractive. The risk is simply too high. Blaming individuals’ stifles creativity and can mean you are not getting the full value from your team’s inventive problem solvers.

Less collaboration — a culture of blaming will create a sense of paranoia when dealing with inter-team or intra-team dependencies. The enmity built up through scapegoating and playing the blame game stifles important bonds that increase the effectiveness of teams.

Poor motivation — above all, working in a team with a blaming culture is unpleasant. Fear of failure, or of the consequences of failure, creates a bad working environment. With poor motivation, the quality of work will suffer and ultimately you will lose talent to attrition.

Let us know in the comments if you see the impact of a culture of blaming in your squad. Once these behaviors become entrenched it can become hard to get rid of them.

So, what’s the alternative? And what benefit is there in it for your team?

Blameless Culture

Etsy were one of the first proponents of a blameless culture in their IT organization (see the John Allspaw article I quoted above). The way this culture is realized in practice is through blameless postmortems, something that also gets a chapter in the Google SRE book. Incident postmortems are the most common place for blame to be assigned so it’s a natural starting point to change the whole culture of blame.

But what culture are we trying to bring in? Let’s look at some of the key principles of a blameless culture:

Assume Good Intentions

The first thing to bear in mind is that people are doing their best. It’s tempting to assume that someone has made an error because of a personal failing — laziness or inattentiveness. This is known as a fundamental attribution error and is probably preventing you from seeing the true failing either in the circumstance or in the system. This leads us on to…

See the Second Story

Sometimes known as “solving the problem twice”. Once you have overcome your instinct to blame “human error” you can look for less obvious causes within the process or technology — areas where it is much easier to mitigate recurrences of this issue.

No Hiding Mistakes

Removing the stigma of making an error turns mistakes into learning experiences. In the context of postmortems this can mean reviewing an incident that had a serious cause but, through luck or timely response, did not have a serious impact.

Celebrate Good Practice

It is in our nature to focus on negative events but reviewing incident responses that went well is just as important as looking at those that went badly. Celebrating the times when your team’s response prevented an incident from impacting customers drives a culture of positivity and allows you to entrench correct behavior.

Accept Failure

The concept that underpins all of the above points is accepting failure. Once failure is seen as an expected part of operations, the process of responding to failure can be perfected rather than ignored.

Hopefully the above gives an outline of a blameless culture. There are certainly elements of this culture evident in teams we have dealt with. Does this culture resonate with your team? Let us know in the comments.

Cultural change is notoriously difficult to affect in an organization however. Of course the most important factor is that we get leadership buy in, change is driven from the top. Another important factor is that we build habits that entrench cultural change. A great place to build this habit is through postmortems.

Blameless Postmortems

An incident postmortem is the formal review of an incident response, normally in the form of both a meeting and a write-up.

The SRE coaching team have put together a set of coaching resources on introducing and running a blameless postmortem. We are currently testing this material with some early adopter teams. As part of our coaching, we run two sessions — an introduction that gives an overview of blameless culture and a dry run postmortem to practice those ideas.

Here are some tips for how to bring blameless culture into your postmortems:

Structure — when to have a postmortem should be something you define as policy rather than decide situationally. Make it clear what will trigger a postmortem, who will be present and what the structure of the session will be. Making postmortems a defined part of your ways of working takes the stigma away from the event.

Start with a positive — a good way to get things off on the right foot is to begin by celebrating something that the team did well. Even if this is just praising the hard work and long hours that the team put in, starting with a positive will set the tone of the meeting.

Blameless language — the language you and your team use in a postmortem sets the tone so avoid using names and try to stick to “I” statements e.g. “I didn’t get the data I needed” rather than “Dave didn’t deliver the correct data”.

Publish and promote — once you’ve completed and written up your postmortem try to get it reviewed widely and then publish it in a public forum. Getting the output of your postmortems in front of a wide audience is the best way to embed blameless culture across the organization.

Our coaching goes into a lot more detail on how and why you should run a blameless postmortem but hopefully this post has given you some food for thought. Implementing blameless postmortems gives you a foothold in blameless culture, a key part of the Site Reliability Engineering mindset.

We are still looking for teams to run these sessions with so please reach out if you feel this would be useful for your team.

Articles and comments are my own views and do not represent the views of my employer, Accenture.

--

--

Sonny Dewfall
The Pinch

SRE, DevOps and Quality Engineering specialist at Accenture.