How to Survive Production Incidents in a Company with No-blame Culture

Olga Beregnaya
Wix Engineering
Published in
6 min readJan 13, 2021

Every so often in our lives we all face a truly terrifying creature — a production incident. I am not referring to, say, pharma or aircraft manufacturing industries where a software bug can cost human lives, of course. Yet even for web services minutes of downtime can hurt users’ businesses, ruin company’s reputation, or damage people’s careers.

A Production Incident is a massive system failure that is accompanied by significant damage for consumers of the product.

Examples of production incidents: system downtime, payment transactions are not going through, user data corruption, etc.

Does this mean that once a production incident takes place, it’s always someone’s fault? Was there a QA who didn’t do their job thoroughly enough? Was there a developer who implemented something at the last moment, or a product manager made a decision to release a particular feature too early?

Certain companies have a policy of looking for someone to blame when bad things happen, and when found — punish them as a warning to others. Can an approach like that motivate people to work better? I don’t think so. Fear of making a mistake, public dispraise can only kill creativity and train people to avoid hard decisions and thoughtlessly follow the rules.

If the penalty tactic doesn’t work, should we just turn a blind eye to production incidents? Nobody can avoid making mistakes, so why not treat them as part of a working process? Might sound logical, except for one small detail — a production incident is not a trivial bug. Having a production incident means there’s a major issue with the system, with the processes or with communication. To ignore such an issue would be to consciously let it happen again in the future, which, in my opinion, should be unacceptable for any successful company.

What might help here is promoting a company culture that would encourage people to have open discussions about any type of failures, and at the same time teach people about personal responsibility. This would also help to create a safe environment for the company employees, where they have room for error.

This is referred to as ‘no-blame culture’. I’m lucky to work in a company with this exact culture — Wix — and want to share with you how it looks from the inside.

Let’s consider the phases of an incident:

  1. Incident discovery (system alert, user complaints, self-identification)
  2. Data collection (find what doesn’t work, who is affected, steps to reproduce, etc.)
  3. Criticality assessment (which side of the business is affected, types of users, calculate possible aftermath)
  4. Actionable steps to recover the system (fix, rollback, switching traffic, etc.)
  5. Minimization of damage done to users (run a script to restore user data, contact users if needed)
  6. Root cause analysis & discussion
  7. Postmortem sum up
  8. Work on action items to improve the system

Once an incident has been detected, the first thing that should be done is to collect as much data as possible to localize the buggy scenarios, understand user impact and work out a plan to recover the system. It is impossible to do it all alone: it requires involving customer care agents, business analytics, operation managers, tech leads, etc. Not to mention that the more coordination there is within a team like that, the faster and more seamlessly the incident will be extinguished.

Don’t let emotions take over you — whether it’s anger, irritation, or disappointment. It’s better to put all the efforts into minimizing damage done to users — and when that’s taken care of, with a cooled off heart, to do the research to find the root cause and the underlying factors of the issue. It’s time for discussion; not for the individual journey, but a collective brainstorm. The trigger of the incident might as well be a specific action by an employee, but the real cause, in a complex system, in the majority of cases is actually a combination of a variety of factors.

Let’s take a look at a very common example: a service ran out of memory and stopped responding. The trigger for the incident was a commit by a developer. The fix is simple — roll back to the last production version.

In this case, the incident analysis might be:

- check with devs why that service ran out of memory

- check the deployment process with DevOps

- check with QA why it wasn’t noticed during testing

- check test coverage with the automation team

- check with support agents the production monitoring

- etc.

I hope you agree with me that a complex analysis like that can hardly be done by just one person. You know, it’s like going to a doctor with a persistent headache. A neurologist would be looking for the problem in the nervous system, an endocrinologist — in the endocrine system, an ophthalmologist would check your eyes first. And what are the chances their diagnosis is correct? Often a symptom caused by the nervous system is triggered by a disease of the endocrine system and vice versa.

What if instead of reviewing all of them one by one you could have them all in one place? There is a good term for it — a synergy, when the result of the group work significantly exceeds the result of each of the participants working separately. In medicine, synergy is represented by doctors’ concilium which allows to perform a comprehensive examination of the patient’s issue and usually helps properly diagnose the actual disease.

The same is applicable when investigating a production incident. Use the synergy power and a variety of people with expertise in different fields to build an all-around picture that would cover much more ground.

All the above refers to the main point behind a no-blame culture:

  • Work together on fixing the issue
  • Have a group discussion to learn about the cause of the issue
  • Perform action items to improve the system
  • Move on

You may say now that no-blame culture looks good, but does it provide enough awareness and a sense of responsibility for major failures? My answer is yes — and the tool for such ‘educational practice’ is a Postmortem. Postmortem is a document where we publish all the information regarding the incident, including the timeline, the detailed reasons, user impact, people who were involved, how we fixed it, and the action items that have to be performed in order to not let it happen again. This document is published for the whole company to see, allowing anyone to learn about the incident.

There is always a responsible person or group of people who own a Postmortem. I participated in writing one a few times and I can tell that once you gather information related to user impact, you will do anything to not let it happen again. When you see the impact an issue, a system failure has had on users, it really stays with you and becomes a huge factor in drawing conclusions and reflecting on the situation, learning lessons from it.

At last, I want to tell what no-blame culture means to me personally:

  • for me as an RnD engineer, a culture like that provides a safe environment where I can learn, experiment, and play with my product knowing that if I make a mistake, there will be people for help
  • as a QA leader, I can rely on it and encourage my team to try new things, to be open and supportive to others when needed, to allow them to be responsible for their decisions, and make sure they have room to grow;
  • as a manager, I learn how to deal with crises and how to communicate with different departments and provide feedback.

To sum it up

There is no way to avoid production incidents completely. Using a penalty system or simply ignoring incidents is bad practice in the sense that it kills people’s creativity and doesn’t let them learn from what happened.

An alternative method to handle production incidents is based on the no-blame company culture. In a nutshell:

  • leave emotions aside, focus on fixing production issue
  • do not blame the person who triggered the incident, instead, do a group discussion to build an all-round picture and reveal all aspects of an incident;
  • work out and execute action items to improve the system;
  • assign an owner for the Postmortem sum up, let people take the responsibility for their actions;
  • move on.

This culture means a lot to me personally. I can safely play and experiment with my product, teach my team to be supportive and responsible, and grow as a manager by learning how to deal with crises.

I hope you now have a few ideas of your own on how to survive a hard time like the one we just discussed. Stay safe and don’t be afraid to make a mistake!

--

--