Management is not about sorting apples

Use of blameless post-mortems is one of the most notable and perhaps most misunderstood aspects of Etsy’s engineering culture. John Allspaw wrote about them on Code as Craft back in May, 2012. In that post, he talks about human error and the “Bad Apple theory,” which asserts that the best way to eliminate error is to eliminate the “bad apples” (also known as humans) who introduce error.

Blameless post-mortems are usually thought of in the context of outages and outage investigation. What I believe is that once you accept the reasoning behind building a culture of learning as opposed to a culture of blame when it comes to outages, it should change how you think about management in general.

Etsy’s practices around post-mortems are drawn largely from the field of accident investigation. The concept of local rationality is one of the key pillars of understanding the human role in accidents and outages. You can read about it in this rather dry paper, Perspectives on Human Error: Hindsight Biases and Local Rationality, by David Woods and Richard Cook. To oversimplify, at any given time, people take actions that seem sensible to them in their current context. Even when people take what seem to be negligent shortcuts, they do so confident that what they’re doing is going to work — they just happen to be wrong.

The challenge we face is to build resilient systems that enable the humans interacting with them to exercise local rationality safely, disasters occur when the expected outcomes of actions differ from the actual outcomes. Maybe I push a code change that is supposed to make error messages more readable, but instead prevents the application from connecting to the database. The systems thinker asks what gave me the confidence to make that change, given the results. Did differences between the development and production environments make it impossible to test? Did a long string of successful changes give me the confidence to push the change without testing? Did I successfully test the change, only to find out that the results differed in production? A poor investigation would conclude that I am a bad apple who didn’t test his code properly and stop before asking any of those questions but that’s unlikely to lead to building a safer system in the long run. Only in an organization where I feel safe from reprisal will I answer questions like the ones above honestly enough to create the opportunity to learn.

I mention all of this to provide the background for the real point I want to make, which is that once you start looking at accidents this way, it changes the way you think of managing other people in general. When it comes to the bad apple theory in accident investigation, the case is closed, it’s a failure. Internalizing this insight has led me to also reject the bad apple theory when it comes to managing people in general.

Poor individual performance is almost always the result of a systems failure that is causing local rationality to break down. All too often the employee who is ostensibly performing poorly doesn’t even know that they’re not meeting the expectations of their manager. In the meantime, they may be working on projects that don’t have clear goals, or that they don’t see as important. They may be confronted with obstacles that are difficult to surmount, often as a result of conflicting incentives.

There are a million things that can lead to poor outcomes, only a few of which are due to the personal failings of any given person working on the project. If you accept that local rationality exists, then you accept that people believe that their work performance meets or exceeds what’s expected of them. If they knew better, they would do better.

All this is not to say that there are never cases where an employment relationship should end. Sometimes people are on the wrong team, or at the wrong company. What I would say though is that the humane manager works to construct a system in which people can thrive, rather than getting rid of people who aren’t succeeding within a system that could quite possibly be unfit for humans. Even in the case where a person simply lacks the skills to succeed at the task at hand, someone else almost certainly assigned them the task or allowed them to take on that responsibility. Their being in the position to fail reflects poorly on the system as well as on the individual.

These principles are easier to apply within the limited context of investigating an incident than the general context of managing an organization, or the highly personal relationship been a manager and the person who reports to them. Focusing on creating a system that works well for the people who participate in it is the bedrock of building a just culture. As managers, it’s up to us to create a safe place for employees to explain the choices they make, and then use what we learn from those explanations to improve the system overall. Simply tossing out the bad apples leads to the formation of a team that is unable to look back honestly and improve.


This post originally appeared on my blog. I also adapted it into a talk that I presented at Monktoberfest 2015, which is available on YouTube.