Antifragile System Design 2: Redundancy And Spare Capacity

8 min readOct 20, 2023

Redundancy is ambiguous because it seems like a waste if nothing unusual happens. Except that something unusual happens — usually.

— Taleb

Yes, there still are limits (photo by Vitaly Taranov on Unsplash)

This is the second post in a series about antifragile system design. In the first post on optionality, I went through the drudge work of defining antifragility for the complete beginner. As a quick recap, antifragile systems are systems that, somehow, gain by volatility, disorder, and disruptions. It's a noble goal indeed. Since no one knows how to ensure antifragility—not even how to measure antifragility—I set myself the slightly diminished task of listing the most important necessary conditions for antifragility. I conceded that any system can be only antifragile against events above a certain probability p, a property which I’ve uncreatively dubbed p-antifragile.

I’d started with optionality because it’s one of the lesser-known but still central principles of antifragility, and it opens up a grand vista. Since optionality can easily be expanded to cover just about any system property, I gave a somewhat constrained definition: Optionality is the property of a system to react to potentially disruptive events in more than one way within the limits of the system’s design. If you haven’t done so already, it’s worth rereading this definition slowly to avoid missing the more profound implications.

Let’s look at redundancy and spare capacity, which are more straightforward concepts and nearer to our everyday conception of resilience and antifragility.

First, note that the two concepts are categorically different. While spare capacity means that you have leeway to handle uglier events than could be expected on average, redundancy means there’s more than one way to handle events should they become too many, too fast, or otherwise too problematic.

Spare capacity for up to 1.900 vehicles per hour per lane (photo by Connor McSheffrey on Unsplash)

The easiest way to envision the difference between the two concepts is by imagining a road system. Spare capacity means that roads are broader and have more lanes than is usually necessary, and redundancy means there’s more than one pathway between two important points.

There are many ways and modes to get from A to B—redundancy! (Photo by NASA on Unsplash)

Notice the filler words “usually” and “important.” In a limited world, spare capacity can’t be limitless, so capacity is provisioned to allow event processing up to events of a certain probability p. The lower p, the more capacity C you need, with C → ∞ as p approaches zero. So, “usually necessary” is colloquial for “able to handle events up to some probability p.” Since, in practice, you don’t know the precise probability of possible events, least of all if the event has never happened before (as Taleb forcefully expounded), therefore you have to help yourself by bin-packing events into more or less fine-grained boxes, say “high probability,” “medium probability,” “low probability,” and “negligible probability.” Then, you provision capacity according to your budget. In this way, spare capacity must be handled like the overall concept of antifragility—you can’t have perfection, so you need a cutoff probability.

The other filler was “important.” Important means that you have to choose which “trips” (events) are so valuable or fragile that they need extra safety. You can’t provide limitless redundancy for the same reason that you are limited in every endeavor where matter and energy are involved: There’s just not enough around. As an aside, redundancy means you have more than one way to handle events—alternate roads, alternate modes like bike paths and subways, alternate coping mechanisms like online ordering and remote working, etc. More than one way to do things; that’s optionality. Redundancy is a proper subset of optionality.

No Antifragility Without Spare Capacity

This is a deceivingly obvious low-brainer that nevertheless reveals a few fundamental properties of dynamic systems. Imagine a complex system, say, an IT platform, that can handle exactly the estimated average load without any spare capacity. Experience tells us that this platform will overheat, crash, or crumble to dust sooner than you can spell “wobble.” But why?

Complex systems, such as transportation networks, electrical grids, or even digital networks like the Internet, exhibit fluctuations in load due to several inherent properties and external factors. There will never be a genuinely constant load.

Nonlinearity

This is one of the fundamental properties of complex systems. Basically, in complex systems, tiny changes in the inputs can produce vast effects on the outputs, which has been coined “input sensitivity” by Edward Lorenz in his useful 1995 book “The Essence of Chaos.” Still, you’ll surely agree that the term “butterfly effect” (inspired by Lorenz but not invented by him) sticks better. You could even go that far, turn the tables, and define complex systems as those that exhibit nonlinearity. There has never been a nerd who didn’t relish a piece of circular logic.

Cascading Feedback

Another funny property of any complex system is the intricate interconnectedness and interdependence of parts, pathways, and agents within the system. Changes in one element cause changes in others, which affect the first element in turn, leading to sudden load spikes and (of course) nonlinear fluctuations. As an everyday extreme example, imagine a road net filled to capacity. Something as simple as a single fender-bender or even just a lane blocked by a police car leads to total gridlock in mere seconds.

Emergent Behavior

Complex systems exhibit behavior that can’t be predicted from the properties of individual components. That is, they can’t be predicted easily by us humans. We’re not good with complexity and emergent behavior; otherwise, I wouldn’t bother writing this. You can study water vapor, air, and sunlight all your life and never even once envision a tornado if you’ve never learned about one. Therefore, look out for emergence. It need not be catastrophic—countless instances of emergent order and de-growth exist.

Adaptive Agents

In many complex systems, the components or agents in the system learn and adapt over time, changing their behavior based on past experiences. This is clearly so if humans, butterflies, and other intelligent beings have their hands in the game, but it’s also increasingly the case with IT systems that are equipped with appropriately so-called machine learning capabilities, where responses to events are learned rather than hardwired. Such adaptations, of course, change system dynamics and produce fluctuations.

Asymmetries & Entropy

No complex system is perfectly symmetrical. Asymmetry, on the other hand, means there are imbalances, bottlenecks, and breaking points prone to producing problematic events. Moreover, not just physical things have the uncanny tendency to crumble to dust. Roads decay in weird spots. Electric circuitry goes up in smoke. Water finds a weird way where it’s absolutely not wanted. Diligent engineers get bored. Alarms aren’t triggered, or are being ignored, or haven’t even been set. A situation develops.

External Factors

I don’t know what randomness is. But I do know that no system, complex or not, lives in blissful isolation. There will be technical, social, economic, ecological, or just plain strange events unpredicted by the system’s designers that still affect the system—and produce fluctuations. Reality comes knocking.

Like it or not, life means variation, fluctuation, and volatility. Fortunately, we can estimate compounded random distributions of any kind as a normal distribution, which lends itself to a nice and clean mathematical analysis. Have a look at the Thee-Sigma-Rule, a simple piece of practical lore first published back in 1994 by my great statistics professor, Friedrich Pukelsheim. Simply, it states that for an approximately normally distributed random variable, about 68% of all values lie within one standard deviation of the mean (the usual sign for standard deviation being σ, the small Greek letter sigma), about 95% lie within two standard deviations, and more than 99.7% lie within three.

Here’s a simplistic, illustrative example: Say you have an average of 1000 cars passing your house every rush hour, the number of cars being approximately normally distributed with σ = 100. Then, by virtue of the Three-Sigma-Rule, you can conclude that in 99.7% of all rush hours, between 700 and 1300 cars will pass your house.

So, things will go up and down. And now you know just about how much: Designing spare capacity for three sigmas puts you well on the safe side. Please note that I’m ignoring upward or downward trends; these must be handled evolutionarily. Stay tuned…

Types of Redundancy

No, I won’t bore you with a complete annotated list of redundancy types, though you can rest assured that there are many: Hardware redundancy, software redundancy, information redundancy, time redundancy, spatial redundancy, modular redundancy, stand-by redundancy, etc. pp. As an expert systems designer, you already know your way around those, otherwise, ask the chatbot of your preference.

Just keep in mind that “you are redundant,” in antifragile system design parlance, doesn’t mean you’re about to be fired. Rather, it means that while it may seem like a waste to keep you in easy times, you’re necessary for survival and evolution if shit comes to shove. And easy times, as we’ve seen in the discussion of nonlinearity and emergent behavior, have a limited shelf life.

You can view redundancy like insurance. The higher the risk—as usual, the product of potential damage and event probability—the more insurance you’d want. If you’re able to estimate risk in terms of dollars and euros, you can then easily justify investment in redundancy if your “insurance” is significantly cheaper than the risk it prevents or mitigates. This is but dry language for adding layers of fat to cope with lean times. But not much more.

No Antifragility Without Redundancy

Well, this really is a one-liner and only a bit longer if spelled out in plain English: If your system has a non-redundant subsystem S* that can be crashed or degraded by an event with probability p (that, by definition, wouldn’t be a problem had you thought of redundancy measures for temporarily replacing S*), then your system isn’t p-antifragile. Therefore, redundancy is necessary for antifragility.

I could have spared myself the hassle since, as I said, redundancy is a proper subset of optionality, and I have already shown that optionality is necessary for antifragility. But I believe redundancy is so important that it deserves special treatment.

Next Up: Evolutionary Design

Optionality, redundancy, and spare capacity naturally lead to evolutionary design. Here’s another quote by the inimitable Taleb:

Nature likes to overinsure itself. Layers of redundancy are the central risk management property of natural systems.

Overinsurance is the key to evolution. We can even go back to good old Darwin and observe that selection, natural or otherwise, necessitates broad variation. There must be different things to choose from if we want evolving systems—there must be optionality.