Blameless Incident Reviews at Udemy

Joan O'Callaghan
Udemy Tech Blog
Published in
9 min readMar 23, 2021

--

Here at Udemy I run the Incident Review process.

This basically means that whenever something breaks in engineering and annoys people enough, we write an incident report about what happened. We then discuss the report’s findings at a meeting.The report is of a standard format to ensure we’re consistent in the data we gather about the breakage.

The goal of this exercise is not to assign blame.

The goal is to find out what happened.

You might think they sound like the same thing, but that is not the case.

Take the following 3 example scenarios:

Approach 1 — Very Blamey

“Lee did a dumb thing to the config file, and that’s why the site went down. We’re going to fire them and now nothing will break any more. This totally won’t frighten other people from coming forward next time.”

Approach 2 — Not Blamey, Not Useful

“The config file was mysteriously changed in a suboptimal way, and that’s why the site went down. We don’t really know what happened. We also have no way to prevent this from happening again as we don’t know why/how it got changed. We’re not going to take any follow up actions.”

Approach 3 — Blameless and Useful

“The SRE team updated the config file in order to facilitate project X but didn’t have a test to prove that the change was safe. Previous changes like this had been done without problems but this event didn’t follow that pattern. We’re adding extra logging to find other weird scenarios like this. In addition, we’ll have automated tests for any changes to this configuration file.”

This last approachthe Blameless way — is what we practise at Udemy.

Why are we doing it this way?

At a micro level, in order to solve the problem, you need to know what happened. Then we can make changes and prepare better for the next incident. Otherwise, the problem may happen again.

To know what happened, people have to feel safe telling you what happened.

They should not fear they will lose their jobs.

Let me say that again.

If you make a mistake at work, you should not be so worried about being fired, that you do not tell anyone what you did. You should not be so afraid that you lie about what happened during an incident.

A mistake is an accidental error. A mistake is not made on purpose. Bad intent was not present. If you made that mistake, someone else could too. This information is vital for improvement and correction.

At a macro level, Engineering will be plagued by mysteries and repeat issues if staff are not able to investigate problems properly. Imagine if I said:

“I’m giving you some top class engineers. Half of them will lie 5% of the time because they are scared, and 5% of them will spend 10% of their time trying to find the liars and fire them.”

This is not a recipe for success and growth.

It is negligent of a company’s Engineering Management if they do not foster an environment where people can share mistakes and vulnerabilities. All you end up with is secrets, lies, more problems, and a very toxic workplace.

There is no such thing as “Human Error”

We cannot create a successful incident report process without understanding the following and accepting it as a truth.

In the context of engineering incidents — there is no such thing as human error.

One of the best ways I’ve heard it described is as follows:

““Human Error” is a label which causes us to stop investigating at precisely the moment we’re about to discover something interesting about our system.” Nick Stenning

Let’s take the example of Lee the Engineer again and some buttons to help explain these points more.

We have a hypothetical system.

There’s a Button A, and a Button B.

If you press Button A, you must press Button B afterwards.

If you don’t do it in that order, then the website will explode.

If you only press Button B, the website will explode.

If you only press Button A, the website will explode.

Got it? Right. Back to the concept of human error.

Scenario 1

Lee presses Button A or Button B only. Lee didn’t actually know what they were doing and pressed them by mistake. The website explodes.

Blameful diagnosis: Human error.

Follow up action: Reprimand Lee.

VS

Blameless diagnosis: Lee shouldn’t have been allowed to get near the buttons. It’s obviously not their area of expertise or knowledge. The issue was a lack of appropriate access control at the system level. This could have happened to any engineer that was in that area while working.

Follow up action: Put access rules in place to restrict the ability to press the buttons to only qualified staff.

Scenario 2

Lee presses Button B, then Button A only. Lee knew you had to press the two of them, but didn’t know that the order mattered. The website explodes.

Blameful diagnosis: Human error.

Follow up action: Reprimand Lee.

VS

Blameless diagnosis: The engineer had not been trained properly and didn’t have appropriate documentation to help. The issue was a lack of appropriate resources for this engineer. We have not set them up to succeed. If all the other more experienced engineers knew what to do, and had never made a mistake, then that’s not expertise, that’s Tribal Knowledge, and without an improvement in documentation and training this engineering org will never scale properly.

Follow up action: Improve training and documentation.

Scenario 3

Button A must be pressed, then Button B, then Button A for 3 seconds, then Button B for 5 seconds, then both buttons for 3 seconds. If this exact sequence is not followed, the website will explode. Timing sequence for the first two button presses is unknown. We don’t think it matters, but this isn’t something we can easily test.

Lee mistimes one of the button presses. The website explodes.

Blameful diagnosis: Human error.

Follow up action: More detailed documentation and training.

VS

Blameless diagnosis: The required process is terrible and difficult for humans to do. Consequently it is error prone. This is not the first time this problem has happened. Tooling should be added to help with this. Otherwise, again, we are not setting up our staff for success. Anyone who can currently follow this process repeatedly without causing problems has probably just been doing it this awful way for a long time. This doesn’t mean they’re a better engineer, it just means they’re conditioned. It’s also a bad reason to leave it as it is, or to shame the person who has just been introduced to this nightmare process.

Follow up action: Improve the process, remove the ambiguity, automate/add tooling where possible.

Hold on to that notion that there is no such thing as human error during engineering incidents.

Incidents are full of mystery and stress.

Next — remember that during an incident, all hell is breaking loose. Decisions are hard, visibility is poor and stress is very high. Red herrings and distractions are everywhere.

It’s ridiculously easy to be judgemental in hindsight while looking at the report and the outcome, because now you know all the answers. You can see the mistakes, the distractions, the missed opportunities. (It’s like watching a horror film and you already know the phone call is coming from inside the house..).

We only know those answers because that team went through hell to get them and wrote it all down for you. The information you are casually using to second-guess them probably took a week to gather after the incident. During the incident itself, no one had that knowledge or context. Just because facts exist, doesn’t mean they’re obviously relevant. During incidents, there are lots of facts. Often the team is drowning in the blasted things. They’ll only know which ones they were “ignoring” afterwards.

Do not judge them on the basis of the outcome. Remember — that was the one piece of information that wasn’t available to the person making the decision.

How to Implement Blameless Incident Reviews

This is the process/framework I recommend to propagate Blameless Incident reviews.

1. Exec and Management Buy-in

Get buy-in from Engineering Management and Execs. Explain the benefits. Explain that this doesn’t mean there is no accountability, rather there is fear-free accountability combined with actual improvements. They’ll nod and agree but then you have to get them to commit to it in their actions. You may have to coach them if they are unfamiliar with the principles. This means during incidents they do not start shouting at people or chucking blame around from a great height. They must lead by example or at least keep the ranting to themselves in a nice soundproofed meeting room. Happily, they’ll see blameless incident reviews quickly provide accountability + improvements + engineering harmony. There is literally no downside to the process for them.

2. Keep everyone calm during Incidents

During incidents/mistakes/errors — lead by example and divert/mute anyone who’s becoming antagonistic. Get help from management if necessary. If it’s management that’s being antagonistic, get help from the execs.

3. Preview and vet the Incident Report

The next step of the Incident is the written report. Generally it will be written by the team involved in the incident, and then shared or presented to a panel of peers. Prior to being shared, the report must be vetted by someone1 for blameful language and if found, it must be removed. In fact, there’s rarely need for individual names to ever be in an incident report. The team name alone often suffices. Remove blameful adverbs — e.g, “carelessly, hastily, inadequately.” Ask “How?” instead of “Why?” Focus on what was done well during the incident, or new information/blind spots that were revealed as a result of the incident. Value this new information as a hard won prize and use it to make improvements2.

1 That “someone” should be unbiased and trained in what to look for.

2 The improvements should follow the SMART criteria.

4. An Unbiased and Empathic Peer Review

The next step of the Incident Review is the presentation of the report to a panel of peers. The meeting should be chaired by a person without bias. (Even if they have bias, they should be able to fake neutrality like a champ. We are human beings after all.) Empathy for the team involved in the incident is crucial. Attendees should leave the meeting feeling sympathetic for the teams involved and satisfied that we’re doing enough remediation items to ensure it’ll be better next time. The Chair of the meeting should either present the report for the involved parties, or else referee the meeting. They should not allow a blame fest or attacks on individuals or teams to occur. This does not mean they are hiding facts or glossing over issues. They are trying to get us to a better future state and preserve good working relationships between engineering teams at the same time.

5. Share it

Even when you have a well established Blameless Incident Review culture, you will still have engineers that are terrified to come forward. They might be new, or they might simply never have been involved in an incident before. They might have worked at a blameful and toxic company beforehand. The thought of writing a report and attending a review scares the life out of them. For those engineers, I explain the process to them in advance, I reassure them that they’re not going to be on trial, and show them the many incident reports that I myself caused. This generally eases their fears enough to complete the process and realise that they have survived the ordeal and still have their job and possibly this process isn’t so bad after all.

However, if we only discuss this during incidents and only to people involved in incidents, the culture will never propagate sufficiently. This is why I also send a monthly summary of all the incidents to every engineer in the company. (aka the Cliff notes for our Incidents.) This gives them the core information without them having to attend the review meetings, or wade through the sometimes lengthy reports.

This should inform them of our issues, our areas of vulnerability and how we’re using all of this information to grow and improve. (It should also reinforce the fact that despite me sending out details of 11 incidents in 1 month, there weren’t any correlating witch-hunts or firings).

Conclusion

As long as there is growth, and change, we will never be “Incident free.” Things break. This is normal. We find out what happened, and then we try to make it better for the next time.

Using your Incident review process to attack people that may have triggered the incident is the most short sighted approach possible. No one plans to make mistakes at work. If it happened to them, it could happen to anyone. That problem was always there, this person just happened to find it.

Treat every incident as an empathetic fact finding mission instead and the resulting blameless incidents reports will allow us to grow and actually improve.

We all want the places we work to be better. More resilient, more automated, more scalable, more everything. We cannot get better without examining our flaws, and that can only be done by being honest with ourselves and making it a safe place to do so.

--

--