Antifragility 101

Photo by Emile-Victor Portenart

Antifragility.

If you’ve been wondering what this term means, you’re not alone. To be honest, I also had a hard time understanding the concept of antifragility, in particular how it compares to resilience. Fortunately, I know that writing — and the research that goes along with it — are perfect for both gaining and sharing knowledge, so I put this article together.

Resilience by Example

To understand antifragility, I think it’s helpful to understand resilience first. Here’s one definition of resilience (there are different definitions, but let’s stick with this one for antifragility to make any sense):

A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and unexpected conditions.
 — Erik Hollnagel

Resilience is something a system does, not something a system has.

There are many well-known patterns to build web systems that are resilient to certain kinds of failures. We use auto scaling groups, for example, for clusters to maintain a minimum number of EC2 instances (server capacity) to absorb disturbances and continue serving user requests. If an auto scaling group considers an instance to be unhealthy, it will automatically terminate that instance and launch a replacement. The infrastructure can “heal itself” by recovering from (some but not all) failures. That’s one example of resilience as defined above.

With that out of the way, let’s take a first look at antifragility and how it’s different from resilience.

The Antifragile Gets Better

The Wikipedia page on Antifragility is a good start if you want to learn the very basics. Here are some of the more interesting bits (emphasis mine):

  • “Antifragility is a property of systems that increase in capability, resilience, or robustness as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures. It is a concept developed by Professor Nassim Nicholas Taleb in his book, Antifragile and in technical papers.”
  • “Antifragility is fundamentally different from the concepts of resiliency (i.e. the ability to recover from failure) and robustness (that is, the ability to resist failure).”
  • Taleb explains the differences this way: “Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.

In the auto scaling example above, the system is able to restore the desired capacity shortly after losing a server instance — it resists the shock and stays the same.

So far so good, but how, as Taleb claims, do antifragile systems get better? In other words, how do they benefit from disturbances?

One of the few practical examples of antifragility I understood intuitively is related to our body. The human body is an antifragile system because it gets better — faster and stronger — through physical training. It will adapt to the stress of exercise with increased fitness if the stress is above a certain threshold, but not too high either (a process called adaptation).

While that example is relatable, I still had trouble applying the idea of antifragility to areas like software development and web operations.

Potential Downside < Potential Upside

Eager to learn more about antifragility in the context of DevOps, I read Antifragile Systems and Teams by Dave Zwieback. This short report turned out to be the best summary of the topic I’ve seen so far. I highly recommend reading it (especially if you can’t stand Taleb’s writing style).

“The main property of antifragile systems”, Dave writes, “is that the potential downside due to stress (and its retinue) is lower than the potential upside, up to a point.”

Some examples from the report:

  • “Vaccination makes a population antifragile because the downside (a small number of individuals having negative side effects) is small in comparison to the upside (an entire population gaining immunity to a disease).”
  • “[With BitTorrent] the more our file is requested, the more robust to failure and available it becomes because parts of it are stored on a progressively larger number of computers. […] our cost of distributing this file would remain constant — not so for the cost of making systems more robust to anticipate higher demand or improve resiliency.”
  • “[The potential downside of frequent deployments] is smaller than the potential upside. […] customers receive higher-quality products and services (i.e., value) faster and at a lower cost than is possible with traditional, risk- and volatility-averse approaches.” (Put another way: If it hurts, do it more often.)

Dave goes on to show how the layers of DevOps (culture, automation, measurement, and sharing) can contribute to the antifragility of organizations, concluding that there’s “significant overlap in practices of DevOps organizations and those that seek the benefits of antifragility”. After all, “DevOps embraces and makes use of disorder, randomness, and, ultimately, impermanence”.

Chaos Engineering

Speaking of embracing impermanence: if it’s possible for systems to benefit from shocks — to become more robust as a result — the idea of injecting faults on purpose suddenly doesn’t sound so crazy anymore, right?

In fact, there’s a discipline called Chaos Engineering centered around this idea. Michael Nygard, the author of Release It!, put it well in this comment:

Chaos engineering is a technique to create antifragility. That is, if you evolve toward systems that survive that kind of chaos, then your systems will exhibit antifragility.
However, one caveat: antifragility is not a universal or omnidimensional characteristic. Chaos engineering causes your system to evolve toward antifragility toward those kind of stresses.

Antifragile systems might benefit from variability, but not any variability. A system can’t be universally antifragile similar to how it can’t resist any failure.

Example: Chaos monkey kills EC2 instances. In response, you build autoscaling, masterless clusters. That helps when machines die, but not when whole regions die. Or when DNS fails. Or when data gets corrupted. Or when the marketplace changes. Etc.

The potential downside of Chaos Engineering (occasional service interruptions) is smaller than the potential upside (better overall customer experience), up to a point (experiments causing severe damage that affect customers). While I don’t believe that web infrastructure itself can be antifragile (I might be wrong), it seems plausible to say that Chaos Engineering creates antifragility by enabling teams to improve their infrastructure through experimentation.


So, is antifragility a useful concept? I honestly still don’t know what to think of it. Writing this article led to some insightful conversations that made me question most things I believe to know about resilience. Among other things, I learned that antifragility might be superfluous depending on what definition of resilience you use. I therefore almost decided against publishing. However, I also realized that I’m still learning, that this piece is part of my journey. I promise it won’t be my last take on the topic.

P.S. This article first appeared on my Production Ready mailing list.