Mapping failure causes
Failure happens, and when it does it’s always a good idea to learn from it. The technology industry calls post hoc failure analyses “post-mortems”. For example, Amazon has a standard process known as Correction of Error (COE) which teams follow after customer impacting or near-miss incidents. For some great insights into how AWS conducts operational events investigations using this process, watch Becky Weiss’s wonderful talk from re:Invent 2019: Amazon’s approach to failing successfully. I highly recommend you watch it if you haven’t; Becky covers core AWS doctrine about failure detection and blast radius reduction in detail.
A commonly used technique for discovering and documenting the causal chain of events that has led to a particular failure is the “Five Whys” method. We begin by asking what was the proximate cause for some observable problem, and keep digging until we hit a root cause. “Five” refers to the number of iterations that it is suggested will uncover most root causes but the question can be asked any number of times; you should only stop when you are satisfied with the answer. Using this information we can then draw a straight line from cause to undesirable effect. The hope is that this will lead us to make changes that help avoid the same failure mode in the future. Amazon COEs, like many other industry incident report templates, include a section describing the incident causes captured in this format.
Related techniques such as Ishikawa or fishbone diagrams and timelines might help collect more data on causes but are fundamentally still limited to a very linear representation.
Much has been written about how there really is no such thing as an actual “root cause”. Instead failures in socio-technical systems are best explained as looking at interactions of complex, interconnected factors that cannot be understood in isolation. John Allspaw has written an excellent critique of the Five Whys method. This very informative post / short video by Dr. Johan Bergström talks about the traps of incident investigation. I have a lot of sympathy for this view; as my favourite paper of all time, How Complex Systems Fail, observes:
Post-accident attribution to a ‘root cause’ is fundamentally wrong.
Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessarily insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible. The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localised forces or events for outcomes.
But that line of criticism is not what I want to talk about in this post. It is also not about advanced system failure modelling — if you are interested in these, Adrian Cockroft has written a survey of resilience tools and techniques. A favourite of mine is Richard Cook’s Velocity conference talk, “Resilience In Complex Adaptive Systems”, in which he briefly covers the Rasmussen boundary model; Mark Brooker has also written about this in “Why Must Systems Be Operated?” Finally, Lorin Hochstein has collected a number of resources on resilience engineering. Instead, I want to propose a way to visualise the observable effect and proximate causes using a simple map to convey additional context.
As an aside, the Amazon COE culture is tremendous at protecting individuals from blame — even in those cases where there is a clear deviation from standard practice by an operator. Amazon even recognises incident writeup authors with a badge of honour on its internal user directory. The discussion in reviews typically revolve around issues like: if manual actions were contributing factors, why did the system fail the operator? Why were there no safeguards in place to prevent what happened? Why did the engineers believe their actions were reasonable and safe yet had unintended consequences?
Let’s stick with the linear cause reporting structure for now. It is widely practiced and, from my personal experience, it is absolutely possible to derive useful corrective actions from applying it as part of an overall process. I found that Atlassian’s Incident postmortems article does a really good job of capturing the essential practices as I have seen them applied.
Every step of the way, Five Whys presents us with a choice: though there may be several contributing causes to every subsequent layer we have to pick one question to answer as the setup for the next step. Most post-mortems I have seen do not examine these holistically. In some rare instances, have I seen reports containing two separate chains of factors. An incident report template that calls for a linear list introduces bias towards simplifying the tree of causes to a single chain.
I was recently involved in an incident review with my previous team at EC2. At one point I observed that the causes of the particular outage we were reviewing formed a graph, and not necessarily an acyclic one at that. That thought stayed in the back of my mind for a while and culminated in this post.
Wardley Mapping (longer video introduction, complete book) is a business strategy technique that visualises how components form value chains anchored in a customer need. Components are spatially arranged: the vertical axis represents relative visibility to a user, and the horizontal axis — relative maturity. We can use Wardley maps to convey interdependencies and anticipated evolutionary developments over time.
Could we leverage this technique in the failure cause analysis domain? Let’s use the classic “car won’t start” example from the Five Whys Wikipedia entry. The linear version of the analysis goes something like this:
- The vehicle will not start.
- Why? — The battery is dead. (First why)
- Why? — The alternator is not functioning. (Second why)
- Why? — The alternator belt has broken. (Third why)
- Why? — The alternator belt was well beyond its useful service life and not replaced. (Fourth why)
- Why? — The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
The above scenario might help us to formulate corrective actions such as immediately replacing the alternator belt with an uprated version, installing a battery voltage gauge on the dashboard, and adding the recurring engine bay visual inspections and service schedule reminders to the calendar.
Map of the Problematique
For added context over the linear analysis produced by Five Whys, we could apply the concepts of mapping to visualise the contributing factors in context. I started with the car example from earlier and liberally sprinkled additional made up context to come up with the following map.
I used the Wardley mapping convention of placing the most visible element of the system at the top — in this case, the need to have transport to our place of work, our primary use of the vehicle. The horizontal axis usually represents evolution: from “genesis”, i.e. the brand new and uncharted, to “commodity/utility” on the far right. Here, I have used it to represent change from emergent or rare events, moving to established and recurring patterns of behaviour on the right.
This representation is a superset of the Five Whys report and could be presented in conjunction with the textual narrative. I have highlighted in green the original linear causal chain we started with. I think this approach has many advantages over a plain linear description, just as mapping has over storytelling.
The top-level visible failure is anchored in a user need. (Why do we even care that the car isn’t starting?) This is usually self-evident in most failures but reminding ourselves of the purpose that the failed component serves might inspire novel ways to route around the failure.
One immediate advantage is that the graphical representation allows us to more compactly represent multiple interrelated causes as a tree of dependencies. Using a map in an incident report template might remove some of the bias towards following only a single chain of whys. In this example, the alternator belt might have had accelerated wear because the car was used for more frequent long trips than in previous years, in addition to deferring our annual mechanic visit because of time or cash flow concerns. Sure, the belt might have broken because we skipped a service. But there are probably other contributing factors, such as having a longer commute round trip racking up mileage faster than before.
The notion of an evolutionary axis also allows us to convey speculative impact such as the possibility of accelerated battery wear. Even though it is functioning fine after the belt was replaced, the battery might have experienced a deeper discharge than it was designed for, causing long-term damage. The map allows us to more easily convey known trends and spot emergent patterns. For example, we might flag increasing stress levels as not just a contributing factor but also something that’s been getting worse lately.
The key insight for me is that it is now possible to spot feedback loops as cycles in the graph. Patterns of self-reinforcing behaviour might cause the failure to reoccur, possibly in a different area with worse consequences. Maps could also help us to identify redundant fixes. By considering the failure in the broader context of user needs, we might decide to address the issue by pruning an entire tree of dependencies. As old school programmers are fond of saying, “deleted code is debugged code”. The commute-busyness-stress cycle highlighted in yellow will probably keep causing other issues if left unaddressed.
Depicting failure in this way engages more of our visual processing faculties compared to purely textual reports. A map gives us a different lens with which to view failure; we want tools and techniques that help us arrive at better solutions at a more fundamental level than applying fixes to “the root cause”.
In our example, we might decide that a better way to commute to work is to sell the car, and buy a bicycle. Or, at the very least, that the flashy car we chose also implies a costly maintenance commitment we hadn’t budgeted for — and we could be better off downgrading to something more economical. Recognising the time pressure might be further exacerbated by an even longer and less pleasant commute, we might consider negotiating a part-time work from home arrangement, or even look for a different job altogether.
If there is one potential pitfall about applying Wardley mapping to failure causes, it’s the fact that these factors are not evolving technological components. I think we are still justified in placing them along an evolutionary timeline. Just as technology components, risk factors have a way of settling in for the long run thanks to the normalisation of deviance.
Another potential source of difficulty is the complexity of the resulting maps. Dealing with multiple dependencies probably justifies better dedicated tools than we currently have. More advanced modeling software could even allow us to use more than two dimensions, or to visualise movement over time. This is not a concern specific to failure mapping however.
Finally, I have not actually tried this in practice beyond the toy example I have used to illustrate this post. I would be really interested to do so and share the results. I would be very surprised if there aren’t useful climatic patterns and doctrine that can be identified by the community over time.
The value is in the conversation
A consideration for teams is that the process of constructing a map might be more amenable to a collaborative effort than document writing. Think of the whiteboard-based activity such as the typical diverge-converge-discussion structure of an agile retrospective. We have to apply our judgement in coming up with map elements to include and their placements, and there are probably many defensible positions. Consolidating and arranging the clusters of factors on a whiteboard (real or virtual) helps a team to have better discussions about the context in which they operate. This is why I am excited about mapping in general and applying the technique in new places such as failure and risk analysis.
Thanks to Bhavani Morarje, Anton Tcholakov, and Fernand Sieber for reviewing a draft of this post.