I’ve noticed an interesting pattern when discussing incidents with engineers over the years.
One of the topics that invariably comes up is the concept of “root cause,” a notion faithful followers of my Twitter stream know that I have at least a few thoughts about. Many organizations base their entire process of understanding incidents on the concept, and many of the techniques they use to facilitate that understanding, such as “The Five Whys,” are firmly rooted in this concept of a “linearity of events.”
Challenging this idea, and suggesting that in complex systems, this linearity is soothingly deceptive — but deceptive none the less — always prompts in a fascinating discussion, and often times resulting in impassioned arguments that the idea of a root cause is crucial to understanding how incidents unfold.
The interesting pattern I’ve noticed is the way developers react to this idea versus the way operations engineers react: in my experience, developers tend to argue with more veracity that root cause matters and that cause and effect can be concretely established. Operations engineers, on the other hand, tend to nod and engage with the idea that linear narratives of the complex world may be deceptive.
I’ve always wondered why this is: what it is about developers and their experience that tends to make them react to the idea of “root cause is a myth” like an immune system seeking out a foreign agent, while operations engineers tend to at least entertain the idea?
I’m not entirely sure, but I do have an idea, and it has to do with the different contexts in which the two roles go about their daily work.
Developers work with tools that tend to be deterministic: compilers, linkers, operating systems are complex beasts, certainly, but we think of them as more or less deterministic: if we give them the same inputs, we generally expect the same outputs. And if there is a problem with that output — a “bug” — then the way developers go about solving it is to analyze the inputs (either from the user, or to the suite of tools that encompass the development process), find the “error,” and then change the inputs. This will fix the “bug.”
In fact, non-determinism itself is considered a bug: if the unexpected or errant output isn’t reproducible, then developers tend to extend their investigation into other parts of the stack (operating system, network, etc.) that we more or less assume should behave in the same way as long as we can reproduce the inputs… and if it doesn’t, then it’s still a bug. It’s just an operating system or networking bug.
Either way, determinism is a basic, almost unstated assumption of much of the work developers do.
But for any operations engineer who’s spent time racking and stacking hardware in a data center or arguing with a cloud API, this idea of a fully deterministic world (as long as we can map out all the inputs!) is a fleeting notion at best. Venerable Bastard Operator From Hell jokes about sunspots aside, seasoned operations engineers have seen all sorts of weirdness in the physical world and know that even a noisy neighbor can ruin your day.
So, poking holes in the notion that there exists a root cause of our incidents, and that tools like “The Five Whys” will faithfully (and repeatably!) lead us to that singular root cause, isn’t that far of a leap to make for operations engineers. In effect, it challenges an idea that when many operations engineers look back upon their own career experience, often times never really matched up with it anyway. So the reaction is different.
I do not, of course, mean to imply that developers’ reactions are silly or stupid, or that they are incapable of understanding how linearity may be deceptive. Seasoned developers are likely to have seen their fair share of non-determinism in the world.
But, I do think the reaction I tend to get from developers in these discussions has to do with the fact that the concept of determinism generally serves them well in the day-to-day execution of their work. And their run-ins with non-determinism aren’t as frequent as operations engineers’ fights with Schrödinger’s cat pawing at their infrastructure.
Ultimately, whether or not this fully explains the reactions I see, it is a potent reminder that the substance of our reactions is a complex amalgam of not only the topic at hand, but numerous other factors too.
And this is important to remember, whether we’re debriefing a single incident, collaborating across a software delivery pipeline, or making sense of our broader world.