Impermanence: The Single Root Cause

https://flic.kr/p/zcxPr

If you google for “impermanence”, you will learn that the word means “lack of permanence or continued duration”. This definition is admittedly very vague and very boring. If you look a bit further though, you will discover that impermanence is also the name of an essential doctrine of Buddhism. The doctrine says that “all temporal things, whether material or mental, are compounded objects in a continuous change of condition, subject to decline and destruction”. Well, I don’t believe in Buddha, or Jesus for that matter. But I do strive to understand complex systems (and how they fail). If religion might help me just a little bit in that regard, I’m more than willing to listen.

Let’s try to interpret the doctrine, starting with the core statement:

All things are compounded objects in a continuous change of condition.

Compounded objects are objects formed by combining two or more parts. In complex systems like web systems, it’s safe to assume that all things are compounded. Every hardware component and every piece of software consists of multiple parts.

Given this insight, allow me to take some mental leaps:

All systems are in a continuous change of condition.

All systems are changeable by nature.

And finally, to close the loop:

All systems are impermanent.

In fact, constant change is a prerequisite for systems to function — to fulfill their purpose.

Despite their catchy name, even immutable servers aren’t truly immutable (sorry). For servers to do anything useful — processing HTTP requests, streaming log messages, rendering cat pictures — countless changes have to take place in both soft- and hardware. Change is indispensable.

On a related note, we practice Chaos Engineering to learn something new about our systems by deliberately imposing change on them. In a sense, we embrace the fact that all systems are impermanent.

But why is this understanding useful? Why should you care?

At this point, it’s time to admit that I first read about impermanence in Dave Zwieback’s superb book, Beyond Blame, which led me to write the article, Learning From Failure and Success Through Postmortems. He not only taught me about the meaning of the word but also, and more importantly, introduced me to this powerful idea:

Impermanence is the single root cause for all failures and successes.

In Beyond Blame, Dave explains it as follows: “The root cause for both the functioning and malfunctions in all complex systems is impermanence (i.e., the fact that all systems are changeable by nature). Knowing the root cause, we no longer seek it, and instead look for the many conditions that allowed a particular situation to manifest. We accept that not all conditions are knowable or fixable.”

Let that sink in for a minute. (We’ll get to the details in a bit.)

Eager to learn more, I also read Dave’s free report, Antifragile Systems and Teams, which devotes the first chapter to a brief summary of impermanence, this time going into more detail.

The report starts by repeating the core idea — “systems start, stop, or continue working” due to their “changeable, impermanent nature”, which is the single root cause — and goes on to explain why this theoretical understanding is indeed useful: “it reminds us that all functioning systems will eventually break down”. That knowledge, in turn, “frees us from looking for the ‘single root cause’ of outages, and from the mistaken belief that there is none”. (As Sidney Dekker famously put it: “What you call root cause is simply the place where you stop looking any further.”)

Having accepted impermanence, we might be tempted to blame it for each and every incident. Doing so, however, would be a mistake, depriving us of the opportunity to learn from failure. Besides, we are engineers! As Dave rightly observes, we “cannot accept that things break or function entirely randomly”. I certainly can’t. And rather than giving up our profession and going shopping, we should try hard to “identify [at least] some of the conditions” contributing to the success and failure of our systems, i.e., “conditions that we can actually impact” such as infrastructure design or collaboration in the workplace.

By finding and fixing those conditions — potentially through postmortems — we’re able to improve our organizations and systems in a meaningful way.

In conclusion, we need to stop wasting our time looking for the single root cause. Impermanence is the one cause of all functioning systems and all outages. Period. We should rather focus on the conditions leading to both good and bad situations.

P.S. This article first appeared on my Production Ready mailing list.