Over the coming months, we’ll be publishing a series highlighting the lessons we learned in the room where it happened at Code for America Summit 2018.
John Allspaw, Founder of Adaptive Capacity Labs and former CTO of Etsy, spearheaded a practice of “blameless postmortems” for technology teams— but the philosophy behind it carries immense value for people working in any profession.
Every successful technology begins with a narrow scope, and develops greater capacity as it becomes more popular and efficient… but it never gets simpler. The software teams that support it are experts coping with complexity and uncertainty, often in high-tempo environments with high consequences. When dealing with complex systems, mistakes and accidents are bound to occur — how we respond to them makes all the difference in whether we learn from them.
The goal is to understand how an accident happened in order to better equip ourselves from happening in the future.
The people who were there in the middle of an incident are experts in what went wrong and all the messy details that can prevent it from happening again. Blameless postmortems are part of an approach that emphasizes forward-looking accountability and enables us to turn accidents into real investments in the future.
To hear John’s speech in full, watch the video or read the transcript below.
John Allspaw: Thanks a lot. I’m super excited to be here. So my goal this morning is to convince you that in order to actually learn from incidents or outages or whatever sort of interruptions you have in digital services that there are a few necessary conditions that have to be in place for that to actually happen and not all of those conditions are necessarily intuitive. I also think that as I’m giving this talk I suspect there might be some really good connections with what Rodney just talked about. So here’s a little bit about me. Here are some things that I’ve written, some places I’ve worked and relevant to this talk some places that I’ve studied. Some years ago when I was CTO at Etsy I also managed to get a master’s degree in human factors and systems safety in Sweden so I’ve had my foot, well multiple feet, in multiple buckets of both software engineering as well as fields like cognitive systems engineering and resilience engineering and that sort of thing.
So I want to set the stage here. Something that I find interesting about technology. What happens when it becomes successful in helping people do their work and solve problems. So just about every successful technology that you can think of even non digital starts out with a narrow scope. It’s there. It’s bespoke. It’s to solve a particular problem. But of course when the use of these tools become more popular, more time and effort is put into extending it, making it be able to handle new types of functionality and making it more efficient. Right. So over time these products end up not only with new functionality but also new risks that weren’t present when they were first created. You end up pushing these boundaries by the way this is what’s known as cognitive systems engineering as the law of stretched systems. Every system is stretched to operate at its capacity. And as soon as there is some improvement that improvement will be exploited to achieve a new intensity, a new tempo of activity. So because of this success in software systems means that they grow in complexity alongside that success.
Another way of saying this is successful software never gets simpler. I think that’s something we can all agree on. That growing success and therefore that growing complexity makes it more and more difficult for people to understand how it all works right. No single person or even a single team in many cases has an accurate and comprehensive mental model of the system and so because of this operating and working on these systems can result in surprising situations. We like to call these generally, incidents. So most issues and incidents that most software services and I’m sure those who have built this sort of thing is familiar is most things that happen, things that come up aren’t really remarkable and you mostly don’t notice them. Except for the creators of them might notice; alerts and some bits of modern boring stuff. But sometimes these problems can show up in really unexpected and consequential ways. And for the people in this room that has a particular flavor to it.
So for example the outage of an entire states Department of Motor Vehicles across the entire state. By the way I’m not picking on California. As it turns out failure in software doesn’t really care where you are and there’s plenty of stories like this. Let me use another example, airline reservations. Airline reservation systems we might think the biggest issue is that people can’t buy tickets. That’s only a very small part of it. Even for a very small outage there can be downstream cascading effects for example. Even a hour long outage can have a really huge disruption to things like flight crew scheduling. Remember these are folks that can only work for a certain number of hours per day. Route logistics, all kinds of stuff, refunds. Basically like a combinatorial Tetris of problems that can sometimes last weeks to recover from. So events like these, incidents, especially those that get media attention are almost always what are known as fundamental surprises. That is to say that they are not necessarily foreseen. The ones that are foreseen are taken care of, the ones that are anticipated are generally taken care of.
So we can think of these incidents … Some colleagues of mine have thought of this this way of thinking of incidents as really encoded signals that our systems are trying to send us about how and where our understanding, our mental models is either incomplete or flawed and that sort of thing. So let me summarize where we are now. I’ve described software as really teams of experts coping with complexity quite often under competitive or political or production pressures. This means that their tempo and consequences for getting it wrong are generally quite high. And there’s always an element of uncertainty and/or ambiguity. At the beginning of any outage, anybody in the room who’s ever responded to an outage, gotten page or the site’s down or something like that knows that at the very beginning it’s quite unclear, ambiguous and uncertain is this just a rough Monday afternoon or is this the one, is this the big one. And it’s quite unclear.
So from a cognitive perspective these elements that I describe have a lot in common with these other domains and these other domains have had a lot of time and research and unfortunately a lot of accidents and incidents to be studied by research. That is researchers who understand how people make decisions and what does expertise look like. Now I’ll tell you there’s many parallels between software and these and there’s lots that I could talk about but don’t have enough time. The one thing that I will say is that safety comes from people, it comes not from technology but it comes from people who are continually adapting, adjusting, anticipating, inferring, observing, they’re adapting and changing their work to fit the situations that they find themselves in. And much in the way that Rodney mentioned somatic awareness and there’s a lot of tacit knowledge in software engineers and this is where expertise sits. We shouldn’t be interested in only what makes systems fail and have outages. We should turn our attention to the fact that they are very rare. What are the reasons for your systems being up? It’s because people are doing things and they’re doing things because of what’s known in the study of healthcare in this field as the messy details right.
These are all of the details that people do, the small little adjustments, the things that make them experts. So in the wake of an incident when you see, when you have a bunch of people who have a lot of familiarity and an expertise you want to support them in telling you about those messy details. That’s the thing about experts. Experts are not necessarily expert at telling you and describing to you what makes them an expert. You can help them, you can help them by asking and trying to understand what are the actions that they took when the incident unfolded. What effects did they observe? What expectations did they have? Have they ever seen this type of thing before? What assumptions did they make? And what was their understanding as it all unfolded? The goal is to understand how an accident happened in order to better equip ourselves from happening in the future. This is necessary, getting these messy details is critical.
So in 2012 I wrote a blog post called Blameless Postmortems when I was working at Etsy and the idea of being aware and making effort to avoid blame in post-incident briefings was based on some of this research I’d seen elsewhere. And the idea is simple. People need to give that story without fear of punishment or retribution. And the difficult part about that is that the simplest answer is dependent on somebody who made a mistake. But when you do that what you’ve done is you’ve told the rest of the organization that that’s how you roll and those details in the future will go underground.
So this is a question I get, this is what most people ask when they think that avoiding blame is some sort of hippie be nice approach. So we shouldn’t punish people for making mistakes, then how do you get accountability? Accountability in this sense and certainly in this question, in the frame of the question is backwards looking. What you’re looking for is forward looking accountability. Forward looking accountability means that you have the experts of the incident. They are the experts in mistakes they’ve made. They were there. If you support them and turn them into teachers, turn them into eye witnesses, this is hugely important data. So I know what you’re saying, this is also a possibility, well this is fine. This guy comes from the internet like out in private sector or whatever, this could never work in government. I’ve got two examples that I want to throw out there as possible counterexamples.
I’m gonna read something from the United States Forest Service Learning review guide. This is rather new, in the last couple of years. “If people are punished for being honest about what transpired employees will soon learn that the personal cost to speaking up far outweigh the personal benefits. Improving the safety of a system is rooted in information. Anything that makes information more available is desirable and anything that blocks information should be avoided. It is for this reason that the learning review seeks to identify influences and never blame.” This is a guiding document for the U.S. Forest Service accident investigation including fatalities.
Second. I’m about to put a whole bunch of text on there so I’m just going to read it a little bit and then I’ll show you later. The Department of Defense instruction 6055.7 Section 4. Privileged safety information that includes things like debriefing if I interview you, diagrams, all of your reflections in a debriefing if you’re involved in an accident, in an incident is deemed privileged. It may not be used to support … The goal is to use it for prevention. It’s not used to support disciplinary or adverse administrative action to determine the misconduct or line of duty status of any personnel or as evidence before any evaluation board. It also cannot be used to determine liability and administrative claims or litigation whether for or against the government. You can read more about it. The last bullet’s really interesting. That privilege safety information may only be released as provided elsewhere or upon specific authorization by the Secretary of Defense. They take this seriously. It can happen in government.
So I’m about out of time. The point that I want to make is that this is not intuitive and it takes a lot to really wrap your head around. It is not the way traditional accident investigation, the old ways of thinking. The rest of these domains are moving forward. We have an opportunity. Not all parts of software and organizations are set up in a place where they can make progress on this. If you want to hear more there’s a couple of books that I really really suggest. You’ve certainly spent more money on worse books. Later today there’s a breakout session and I hope you’ll come to challenge me and ask me questions about this. Thanks for listening.