We had a minor outage recently: we made a change to our systems and started serving errors to customers. We rolled back the change within minutes, and all was well again. This kind of thing happens at every company, but not every company is able to improve and learn from these situations.
After every outage, we write a blameless post-mortem to try and learn from our mistakes. It would be easy to slap a bandaid on whatever broke and move on, but we want to be more thorough. What exactly happened? Why did things go wrong? How do we learn from this and prevent the problem from happening again?
In this instance, I also took the opportunity to do a refresher on what “blameless post-mortem” means. Here’s a lightly edited version of what I told the team.
People are rarely the cause
As a rule, people are not the cause of an outage. The fault lies in the systems and software that should have done something reasonable but didn’t. Most outages are triggered by a change in the system, so there’s usually going to be a human pushing a button that sets things in motion.
We could just say “the reason this happened is that Dave pushed the button.” Instead, try asking:
- Why did Dave push that button? Presumably, it seemed like a good idea at the time, why was that?
- Why did pushing the button do something unsafe? The software could have said no, or done something reasonable autonomously.
- Given that the software did something unsafe, how could we have detected that more quickly? How could we have recovered more quickly?
There are a number of ways to get at the underlying systemic causes of outages — for instance the Five whys method pioneered by Toyota, or the Fault Tree Analysis popular in traditional engineering fields like aerospace.
Three key questions to ask
Personally, I’m a fan of a simple three-question prompt that Google’s post-mortem template used:
- What went right? Document processes that worked as designed, safety systems that did their job, and so on. In post-mortems, this section is usually short, but it’s a chance to document the software and processes that are giving you good value during incident response.
- What went wrong? Why are we writing this post-mortem? Each bullet here should typically translate to an action item that’ll get prioritized against the other things the company’s doing.
- Where did we get lucky? This section is to get people thinking about what didn’t happen during the outage, but by luck rather than by design. Being lucky is great, but by definition, you can’t rely on being lucky every time. Did your domain expert happen to be around, and that’s how you mitigated quickly? Great, lucky us! In the future, how do we spread expertise around more to avoid that single point of failure?
All these things are tools, not algorithms to follow blindly. Think of these as ways to get the conversation started, just like brainstorming is a tool to get people in a creative frame of mind. And regardless of the specific tools you use, aim to push past “a person did a thing”, and get at what in their environment led to that action being reasonable, and so forth.
Why does all this matter?
Blameless post-mortems matter for a couple of reasons. The first is obvious psychological safety. Being blamed for outages creates a crappy working environment, and people are going to look for another job.
More self-interestedly for the company, assigning blame for outages leads people to cover their asses. When that happens, the post-mortem ends up incomplete or outright incorrect, because people are not volunteering all the information about what happened, and that leads to drawing the wrong conclusions. Blaming people for outages makes the company worse at doing what it does, in addition to making it a bad place to be for employees.
And on a more personal note: regardless of these words about it not being a person’s fault that outages happen, I know firsthand that it feels really bad when you’re the one who pushes the button and sets things off. If you’re anything like me, you get a fight-or-flight adrenaline dump, and generally, have a very bad time.
So, when outages do happen, please look out for each other and, if you think it’s necessary, reinforce in that moment that it’s not the fault of whoever pushed the button. It may feel silly to do, because “obviously we all know we don’t blame people for this sort of thing”, but humans aren’t linear creatures that you can program once with information and forget about. A timely reminder will do absolute wonders to that person’s mood and feeling of safety.