Learning from Postmortems and Outages

Many of my discussions recently have involved the words Availability and Resiliency. As the worldwide leader in technology solutions for automotive retail, CDK Global has an obligation to the entire industry, automotive dealers, manufacturers, and partners, to have highly available systems ready to transact business at the pace of the market. When our systems are unavailable, business is disrupted across the entire industry. I have responsibility for what we call the Common Services, which are a set of core services on which our products are built. These services include authentication so an outage means that customers can’t even log in, even if the rest of the systems are humming along perfectly. Recently we have increased our focus on improving the reliability of these services that the industry relies on. In order to create the right sense of priority and culture, we introduced our teams to a slightly modified version of Mikey Dickerson’s Hierarchy of Reliability:

Common Services Hierarchy of Reliability

First and foremost, we must have monitoring to alert us to anomalies or disruptions of our service. Once we have monitoring, we must be able to respond to an incident and follow up with an analysis of that incident to learn and improve. Filling out the hierarchy, our services need to be secure, we need to be able to test and release them efficiently, they must be adoptable and utilization should be increasing, we must understand our capacity needs so we are scaling to meet demand, and finally we can improve our user experience and add new features. For most of the items, we have a pretty clear trajectory and vision of what success looks like as they are topics that are well understood inside and outside of our organization.

However, one of those items stands out as it’s something that I’ve not seen done well and that is Incident Analysis, more commonly known as postmortems. After we recover from a service disruption, a meeting is held to determine the “root cause” and publish a report detailing what happened, why it occurred, and what we will do to keep it from happening again. On the surface, it seems like an easy box to check. After the incident occurs, you get together and talk about what happened, often using an approach like the “5 Whys”. But how often does the postmortem end up seeking “who” was responsible rather than “what” conditions existed to allow the event to occur? This is a topic I set out to investigate through the concepts of complex systems, human error, safety, and learning.

Complex Systems

The first stop on my journey led me to the acknowledgement that our solutions are complex systems and in order to further reason about this topic, it’s important to take a look at how complex systems fail.

A complex system is any system featuring a large number of interacting components (agents, processes, etc.) whose aggregate activity is nonlinear (not derivable from the summations of the activity of individual components) and typically exhibits hierarchical self-organization under selective pressures.

This article isn’t a meant to be deep dive into complex systems so I’d recommend additional reading on Wikipedia, the Santa Fe Institute, or the New England Complex Systems Institute. In 2000, Richard I. Cook, MD, wrote a paper called How Complex Systems Fail. Dr. Cook’s research was in the context of patient safety in a medical setting, but much of the findings can be translated to software and IT in general. Here is a selection of his 18 observations:

  • Catastrophe requires multiple failures — single point failures are not enough.
    The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents. Most initial failure trajectories are blocked by designed system safety components. Trajectories that reach the operational level are mostly blocked, usually by practitioners.
  • Complex systems contain changing mixtures of failures latent within them.
    The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations. Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.
  • Hindsight biases post-accident assessments of human performance.
    Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. This means that ex post facto accident analysis of human performance is inaccurate. The outcome knowledge poisons the ability of after-accident observers to recreate the view of practitioners before the accident of those same factors. It seems that practitioners “should have known” that the factors would “inevitably” lead to an accident. Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.
  • Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
    Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible. The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.

“…no isolation of the ‘root cause’ of an accident is possible”. That is a powerful statement that is contradictory to most approaches to postmortems. If we can’t determine the “root cause”, then what is the goal of a postmortem? Learning. In order to learn, we need as much context and as many perspectives as we can get. As stated early, asking “why?” frequently leads us “who” and not “what”. A better question to be asking is “how?”. “How?” can get us to describe some of the conditions that existed during the event and provides a much better opportunity to learn from it.

Second Stories

In “The Black Swan”, Nassim Taleb stated, “we are explanation-seeking animals who tend to think that everything has an identifiable cause and grab the most apparent one as the explanation.” Exploring this idea further led me to the book “Behind Human Error” by a group of world renowned thought leaders in the subject of human error, including Sidney Dekker. This book introduces us to the idea of first stories and second stories. In most postmortems, we stop at the “first story”, identifying human error as the cause and stop searching when we find the human or group closest to the accident who could have acted differently in a way that would have led to a different outcome. The paradigm shift we need to make is one towards the second, deeper story, in which the normal, predictable actions and assessments which we call “human error” after the fact are the product of the systematic processes in the environment in which people are embedded.

So how do we get to these “second stories” and avoid asking “why”? In “The Field Guide to Understanding Human Error”, Dekker provides us with an approach to debriefing participants of an event.

  • First have participants tell the story from their point of view without any replays that will supposedly “refresh their memory”;
  • Then tell the story back to them as an investigator to check whether you understand the story as the participants understood it;
  • Identify the critical junctures in the sequence of events;
  • Progressively probe and rebuild how the world looked to the people on the inside of the situation at each juncture;

Conclusion

Operating complex systems requires the acknowledgement that these systems will fail in ways in which we could have never expected. Building a generative culture that is capable of learning from failure requires us to shift our mindset from one focused on blame to one focused on learning. The brief journey that I’ve taken into the topics of human error, robustness, fragility, and learning have been eye opening to say the least. The immense depth in which these topics can be explored will provide me with many ideas on which to build stronger teams, cultures, and systems. We will be testing the concept of asking “how?” in our upcoming postmortems. I hope you’ve enjoyed reading this and it has opened your mind a bit as well.

References

Here is a list, in no particular order, of the authors, books, papers, and articles that were referenced in or otherwise influenced this article: