Antifragile System Design 5: Chaos Engineering

Hannes Rollin
12 min readNov 1, 2023

--

Chaos /ˈkeɪ.ɒs/ 1 a: a state of utter confusion; b: a confused mass or mixture; 2 a: a state of things in which chance is supreme; b: the inherent unpredictability in the behavior of a complex natural system (such as the atmosphere, boiling water, or the beating heart)

Webster’s Dictionary

Watch your step—small disturbances may have great effects

This is the fifth installment of a short series on designing systems that are at least potentially antifragile and, hence, gain from disorder and shock; after posts on optionality, redundancy, evolution, and modularity, it’s now time to look deeply into chaos, although not in Webster’s sense of 1a, 1b, or 2a, which are the layperson’s conceptions of chaos and a far cry from what we need in antifragile system design.

Maybe an apology is in order. In this post, I’ll quote extensively from John Gall’s important book “Systemantics — The Systems Bible,” not least because it’s pertinent to my inquiry but also to whet your appetite. This is good stuff for architects and system designers. Moreover, if public administrators, policy-makers, elite managers, and hyperactive entrepreneurs everywhere studied this altogether enjoyable book closely and worked with it diligently, millions of pointless deaths could be prevented, billions of lives could be improved, and we could collectively stop running against walls that simply won’t budge, never did and never will.

A Note on Chaos

Rather, we’re interested in definition 2b—”the inherent unpredictability in the behavior of a complex […] system.” Don’t ask me why Webster put the “natural” in there, which I had to cut; complex artificial systems like pinball machines, operating systems, and government bureaucracies exhibit chaotic behavior all right.

To recap what I introduced in an earlier post, I’m working with the conception of chaos as input sensitivity, which has been popularized by meteorologist Edward Lorenz, although the term “butterfly effect” has taken root in the collective unconscious to symbolically represent chaos theory as a whole.

Input sensitivity, in essence, means that tiny variations in the inputs can produce comparatively large output variations. Strangely, this property can’t just be found in the boiling waters of political campaigns and the beating heart of a frantic startup but also in abstract theoretical systems like number sequences. It’s the hallmark of a mathematically trained mind to examine a property using the simplest possible example—without studying two-person zero-sum games, the likes of which are never found in the mess we call real life, it’s highly unlikely that John von Neumann would have come up with the thoroughly useful minimax theorem. So let’s go simplistic.

Here’s an extremely simple example lifted from Lorenz’s book on chaos theory. Consider a sequence of real numbers, beginning with some initial value c(0). Now, alternately square this number and subtract c(0) from it. Here’s some Python code so that you can experiment yourself.

def chaos_sequence(initial_value, iterations):
c = [initial_value]
for _ in range(iterations):
c.append(c[-1]**2 - initial_value)
return c

print(chaos_sequence(2.0001, 15)[-1])
print(chaos_sequence(2.0002, 15)[-1])

What’s happening here? This pretty trivial sequence does weird things for c(0) > 2. The output of this little snippet, compared to the extremely close inputs, is rather divergent after 15 boring steps:

1.5945984072383787e+116
2.209894619509128e+164

Although the input difference is only 0.0001, the output difference after 15 steps is blown up to the tune of 2×10^48, a 2 followed by 48 zeros; a staggering number far exceeding the estimated number of stars in the known universe of circa 10^24. Why is this? Here are the main reasons:

  • Analytically, it can be shown by induction that the sequence c(n+1) = c(n)–c(0) grows exponentially for c(0) > 2. Exponential growth is always a good indicator of chaotic processes. We know from cybernetics that every complex system that exhibits potential positive feedback loops may show (temporary) exponential behavior, at least since Norbert Wiener’s foundational book “Cybernetics: Or Control and Communication in the Animal and the Machine” from 1948. Beware of positive (self-reinforcing) feedback that can’t be contained quickly. It will get ugly.
  • The ongoing resetting using the initial value introduces nonlinearity into the sequence so that—in conjunction with exponentiality—even tiny discrepancies in the initial value compound over time. Think of this philosophically: Instead of providing a baseline and hedging divergence, the continued resetting is itself a source of instability. From system reboots to management dogmas, resetting isn’t as innocent as it seems.
  • The dependence on previous terms is, of course, a property of every natural and artificial cybernetic system. Each state depends on the system’s last state and, more annoyingly, on the state of the environment. And the environment is large.

The System and the other System, the Environment, are engaged in a dance with each other. The output of one is the input of the other.

  • The notorious sequence c has no fixed points or, as chaos theory has it, no attractors. Even seemingly chaotic systems may settle into a relaxed equilibrium if there’s an attractor that somehow pulls the system into a stable configuration regardless of the initial conditions, although the path might be vastly different, as in the freezing of a lake. Think of it as the limit of a number sequence—no matter where you start in the sequence, you’ll approach the same limit, earlier or later. Or think of it like a black hole, once the state of a system enters the so-called “basin of attraction,” say, your spaceship crosses the event horizon, in you go. Many natural systems have evolved this way, but go find an attractor for large human-made systems. Incidentally, it’s nearly impossible to define an attractor and build it into a system. Instead, attractors emerge; they are epiphenomena of complex systems. In terms of complex systems design, we speak of a stable configuration.

It should have become clear that this innocent number sequence has all of the most important elements of a chaotic system. The part reflects the whole. So, now that you know all about chaos, how do we get to chaos engineering?

What Is Chaos Engineering?

Chaos engineering is a brutal discipline within the field of software engineering; it thrived because tampering with software is much cheaper than damaging airplanes. But the principles of chaos engineering, as you’ll see, work perfectly well in other engineering disciplines. Introduce chaos engineering at your own risk.

It focuses on proactively and systematically testing a system’s resilience and robustness in the face of rare failures and adverse conditions. To be clear: You, as a system designer, create these failures and adverse conditions to see what happens. This should be first done during development. The primary goal of chaos engineering is to identify and address vulnerabilities and weaknesses in a system’s design and infrastructure before they can lead to serious problems in production.

Mentally, you treat your system like a chaotic one, which it very likely is, but you don’t usually know where the chaotic parts are. Work with minimal input variations, then minimal error variations, then combinations and permutations, to judge from input sensitivity where chaos reigns. Maybe you even stumble upon a point attractor like the infamous Windows 95 bluescreen—no matter where you start, no matter what you do, you’ll eventually get there.

Hardcore professionals sometimes extend chaos engineering to live production for two reasons: First, test environments are never even close to live environments, no matter how much you put into them. Your system might fly in the restricted test setting but fall apart (or down) in real life. That’s because artificial systems consist of parts that, individually, rarely support the larger cause. Not-working is the default; working is an act of defiance that has to be ceaselessly supplied with energy and error correction.

The System called “airplane” may have been designed to fly but the parts don’t share that tendency. In fact, they share the opposite tendency. And the System will fly—if at all—only as a System.

Second, developers and engineers everywhere tend to be complacent and risk-blind and fixated on the happy path once a comprehensive test suite has been passed; let’s call it the green light fallacy. But, as we all know from experience, the real tests are out there.

Chaos engineering, in its true shape, speeds up system evolution by speeding up exposure to volatility, disorder, stress, and shocks.

Hence, the secret secondary goal of chaos engineering is to arrive at good system design faster than otherwise. But seen through the lens of project management, chaos engineering is yet another costly activity that has only a small measurable influence on product quality as management sees it. That bug that never got shipped? That extra diligence by developers to ascertain redundancy and robust fail-safe mechanisms? Management won’t see them; you have to advertise your worth and meanly highlight the cost of not doing chaos engineering. And not blow up the system in the process.

Chaos Engineering Principles

Despite the confusing name “chaos,” where three out of four possible meanings of the word allude to, well, confusion, chaos engineering is indeed a serious and respectable (if nascent) branch of engineering. Here are a few principles of chaos engineering to get you started:

  • Controlled Experiments: Chaos engineers deliberately inject failures, faults, or disorders into a system in a controlled and measured manner. See? Deliberate, controlled, measured; nothing chaotic here. These experiments aim to simulate real-world situations and observe how the system responds. And to keep system engineers on their toes. And as a chaos engineer, watch out for surprises:

Under precisely controlled experimental conditions, a test animal [read: complex system] will behave as it damn well pleases.

  • Hypothesis-Driven Testing: Chaos experiments are, consciously or not, based on hypotheses about how a system should behave during and after a problematic event. For example, a hypothesis might be that “the system should continue to function even if one of its database servers fails, restore functionality in less than an hour, and never again fail for the same reason.” Basic antifragility. A subtler hypothesis might be that “the system should be able to provide a p99 latency of 100 ms after less than five minutes of adaptation even if 5M users access the system simultaneously” (yes, stress testing is a sub-discipline of chaos engineering). The goal is to disprove these hypotheses since, contrary to popular belief, you can’t ever prove a statistical hypothesis. Here’s why: A statistical hypothesis H states not a fact [whatever that is] but assumes a probability distribution P governing your observations. If you then reject H on the grounds of observation, but P is indeed the correctly assumed distribution, you can still compute that probability using P—this is the famous type I error. On the other hand, if P isn’t the true distribution, you have absolutely no clue which is instead, so you can’t compute anything. Nothing. You can only say, “I could not reject the hypothesis.” This has deep and disturbing consequences, not least that you can never say a system is indeed antifragile. Without quite expecting it, you just stumbled headlong into the limits of human knowledge. To be on the safe side, expect your system not to be antifragile.

The truly pertinent question is: How does it [the new System] work when its components aren’t working well? How does it fail? How well does it run in Failure Mode?

  • Automated Testing: Don’t do it by hand. Chaos engineering experiments, after some initial fumbling, should be automated and randomized to ensure consistency, repeatability, traceability, and real-life-like variation at the same time. Notice also the word “experiments”: Chaos engineering can rightly be regarded as a part of natural science; we use the scientific method to make systems more robust and, ideally, more antifragile.
  • Gradual Introduction of Chaos: Chaos experiments are introduced gradually and cautiously because you don’t want to shoot down your fledgling system. Engineers start with minor disturbances and slowly increase the complexity and severity of the failures to avoid causing major disruptions. You don’t know what will break.

A System can fail in an infinite number of ways.

  • Monitoring and Observability: Effective monitoring and observability strategies are crucial for chaos engineering. Engineers collect and analyze data on how the system behaves during experiments to understand its strengths and weaknesses, that is, to test their hypotheses. Don’t rely on hearsay or apparent behavior. Systems are very good at pretending, and system engineers tend to become part of the system. Big, ambitious systems are the worst:

As the System becomes ever more highly specialized, the simplest tasks mysteriously become too difficult. […] As Systems grow in size and complexity, they tend to lose their basic functions.

  • Learning and Iteration: Chaos engineering is an iterative process. Engineers learn from each experiment and use the insights gained to improve the system’s resilience and antifragility. In effect, you’re actively hunting for problems. A bad day at work is a good day for the chaos engineer.

Cherish your bugs. Study them.

  • Safety Measures: Chaos engineering experiments must be conducted with safeguards in place to prevent catastrophic failures. Engineers need to have a clear rollback plan and the ability to abort experiments if they pose a significant risk to production systems. Although I put them last, ponder these things first.

When Big Systems fail, the failure is often big. […] This cautionary axiom is often overlooked or forgotten in the excited pursuit of grandiose goals by means of overblown Systems.

Chaos Primates

No treatise on chaos engineering is complete without at least mentioning the Netflix Chaos Monkey. Simply enough, this service randomly terminates instances in the cloud to induce engineers to care for proper resilience and fail-over. They’re programming people. Some clever guys at Netflix even went one step further and came up with Chaos Kong, which, like a hypertrophied version of Chaos Monkey and a funny allusion to the city-destroying giant ape “King Kong” of movie fame, randomly switches off entire regions. You can imagine that recovery was a bit difficult the first time around.

While some give Netflix the credit for inventing chaos engineering, the practice has older roots, even in IT, where Apple started with a dumb-user imitation tool literally called “Monkey” as early as 1983. Good old reliability testing has been with us for decades as well, where new products are deliberately exposed to problematic temperatures, vibrations, shocks, humidity, power fluctuations, dirt, and people; see, for instance, “Practical Reliability Engineering” by Patrick D. T. O’Connor (Wiley 2002). The genius of Netflix was the introduction of chaos engineering in production. While chaos engineering during development improves products, chaos engineering in production also improves people. Again, it’s overinsurance for problems that are extremely likely to happen and, therefore, must be reckoned with by design. Chaos engineering in production works well in IT but not so well when it gets nuclear, chemical, or medical, to name a few.

Arguably, penetration testing and red team assessments can be seen as precursors of chaos engineering and are now rightfully and better treated as sub-categories of the same. A bit of social engineering goes a long way for the chaos engineer. Don’t do all the work yourself.

The Intelligence of Systems, the Stupidity of Us

To return to the beginning, be careful with complex systems, especially if the stakes are high. Use chaos engineering, but use it wisely so that it won’t get associated with those not-so-nice meanings of “chaos.”

A System, after all, is a partial intelligence; it participates in the great Mind of the Universe; and unless we ourselves have a direct pipeline into that Mind, we had jolly well better watch our step. Systems don’t appreciate being fiddled and diddled with. They will react to protect themselves; and the unwary intervenor may well experience an unexpected shock.

If you have ever tampered with complex systems of any size, you’ll know that stable configurations—attractor states where the system hovers around and can’t be easily disturbed—are very hard to achieve. On many occasions, we better curb our enthusiasm and humbly accept things as they are without causing trouble we may not be able to contain, like the sorcerer’s apprentice of yore.

Is Chaos Engineering Necessary for Antifragility?

Short answer: No. Long answer: If you have managed to design an antifragile system (not that anyone knows for sure how to do that), by definition, you have a system that improves from exposure to the world all by itself, no chaos engineering required. The non-system (the system’s environment, which is the difference set between the cosmos and your system) already does contain all thinkable and unthinkable ways to chaos-engineer, namely the hard knocks of life, as I have already hinted at in my elaboration of the need for spare capacity. In fact, what‘s usually called chaos engineering is really just artificial chaos engineering, while natural chaos engineering happens unavoidably all along, albeit maybe slower and less evenly distributed. Artificial chaos engineering, then, may harmonize and speed up your system’s path to maturity—or it may tear it apart. Happy chaos-engineering!

Next Up: The Zen And Tao of Antifragility

In the last few paragraphs, my usual optimism gave way to darker shades of humility and realism in the face of our human limits. We just don’t handle complexity very well, we can’t predict, especially not the future, and we’re incredibly good at building systems that make things worse. If you want to minimize your contribution to your individual and our collective misery, stay tuned for the 6th and last installment of this series, where I’ll turn some of the most cherished but misguided beliefs about complex systems on their heads.

--

--

Hannes Rollin

Trained mathematician, renegade coder, eclectic philosopher, recreational social critic, and rugged enterprise architect.