The Awesome Math of Data Mesh Observability: Antifragility

9 min readJul 24, 2023

Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty.

—Nicholas Nassim Taleb, Antifragile

Antifragility: Thriving under variability and stress, the property of life

This is the third post in a loosely connected series of articles on the mathematical armory that can be brought to action in data mesh observability. As a quick recap, let me bring you up to speed:

Data mesh is a relatively new decentralized data management paradigm for handling analytical and research data that promises to avoid some of the worst conundrums of heavy-weight data warehouse and data lake initiatives, which have annoyed us long enough with incredibly high disappointment scores both during implementation and in production. The key to understanding data mesh lies in the application of distributed software architectures to the data realm — domain ownership, data productization, decentralized governance of domain representatives, and a central data platform with DataOps and policy automation under the hood are the four pillars of data mesh.

Observability is a service management paradigm borrowed from good old control theory and bent out of shape to fit the needs of complex software systems. The demand for observability is still growing, as modern distributed systems are largely intractable using merely old-fashioned logging and monitoring, and this is especially true for data mesh. While conventional monitoring gathers log data, metrics, and more in an ad-hoc fashion, trying to understand the big picture bottom-up, observability works top-down by creating (1) models of assumed system behavior, (2) general situations (think disease diagnoses), and (3) only then decides which signals (logs, metrics, events, and traces) to collect in order to correctly identify those situations. While conventional monitoring leads to a sprawling proliferation of log data and a decline of clear understanding as the system grows, often going so far that no one understands certain erratic, epiphenomenal system behavior, a well-executed observability approach leads to continued refinement of model formulation, situational awareness, and signal collection.

Let’s assume you have rigged your data mesh observability so that you correctly recognized an unwholesome significant situation—say, data portal usage decline—and even pinpointed the likely underlying cause, say, performance issues under load that result in intolerable latency spikes and frustrate users. Are you done yet? Of course not. You need to fix the mesh, and if you want to do an outstanding job, the mesh should not just recover from the situation but emerge stronger than it was before. This is the essence of what the great Nicholas Nassim Taleb, the enfant terrible of the risk research community, dubbed antifragility.

What is Antifragility?

Antifragility is a property of systems that increase in capability to thrive as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures. Note the operative word “thrive”: Antifragility is not just robustness—the ability to withstand—but rather the ability to get better under stress. It’s not, alas, how we create our built environment, but it’s a central property of living organisms, ecosystems, and healthy cultures.

Antifragility is more than mere robustness—it’s a property of life

If you look at the properties of antifragility, you see characteristics that are desirable in the context of artificial complex systems like data mesh. Time, volatility, stress, and shock are unavoidable and generally on the rise in these times. It’s undeniably favorable to have an antifragile data mesh. But: How can we achieve antifragility? And how can we measure it?

The Antifragile Data Mesh

Redundancy is ambiguous because it seems like a waste if nothing unusual happens. Except that something unusual happens — usually.

—N. N. Taleb, Antifragile

Here are a few ways to increase the antifragility of your data mesh:

Variable Decentralization: The data mesh approach inherently decentralizes data ownership and responsibility across teams, making it more robust and less prone to single points of failure. This is in line with antifragility, as it allows for small failures without causing system-wide problems. Nevertheless, robustness is not enough. The mesh must be set up so that problems due to too much decentralization (or too much centralization) self-correct or, at least, can be recognized and corrected by the governance guild.
Fail early, fail often: In a system designed with antifragility in mind, failures of a certain kind are not only expected but are also welcomed as opportunities for learning and growth. By allowing for failures in a controlled environment and by learning from them and reacting to them, the data mesh can become stronger and more resilient.
Overcompensation: Similar to how the human body develops resistance to a virus after exposure, antifragile systems overcompensate after stress and failures. Applying this to data mesh means developing extra resiliency, redundancy, or capacity when a weak point is discovered.
Optionality: Data mesh gives you options for where to store and process data. By not committing all resources to a single plan or architecture, you are better prepared to adapt to unforeseen circumstances. Antifragile systems are always characterized by optionality—there are many ways to reach a goal or solve a problem. Build your mesh this way.
Hormesis: Applying the idea of hormesis, or systems getting stronger with exposure to shocks, to data mesh means intentionally introducing errors or stresses to the system to identify weak points and improve them. And to keep data developers and data reliability engineers on their toes. This practice has been masterfully pioneered by the Netflix chaos monkey.
Redundancy and Layering: By ensuring redundancy in data sources and portal services, the mesh architecture becomes more resilient to shocks and stresses. This is particularly important in observability, where missing data can lead to blind spots in system understanding. Redundancy is crucial to antifragility: If a sudden spike in data demand brings some service to its knees, there must be other instances, cold or hot, that can scale up to meet and—this is the point—exceed the new demand.
Modularity: Antifragility implies modularity, where individual parts can fail without collapsing the entire system. While not sufficient, loosely coupled modularity is a necessary condition. In the context of data mesh, this means designing the system such that each data domain can stand alone and operate independently of others.
Leveraging Disorder: In an antifragile system, disorder should lead to stronger overall system health, up to a point. In a data mesh, this means using anomalies or unusual states in the data or system behavior to drive improvements and innovation. The precondition, naturally, is to actively look out for anomalies as part of your observability approach.
Distributed Decision Making: Antifragility promotes the idea that decisions made at a local level, closer to where the deep knowledge is, tend to be more robust. In data mesh, this translates to empowering domain teams to make decisions based on their intimate knowledge of the data rather than having all decisions centralized. Dehghani’s federated governance is inherently antifragile.
Evolutionary Approach: In line with the principles of antifragility, data mesh architecture should evolve over time in response to changing environments and new needs rather than adhering strictly to an initial design. This allows the system to respond dynamically to stresses and challenges, thus improving its resilience. As always in large systems, especially large complex systems: Start small and keep tinkering. While there are probably fixed rules that govern the system, they aren’t usually clear in the beginning. And they change.

Antifragility Metrics and Heuristics

Antifragile is a much less researched property than significance and even causality. Nevertheless, I’d like to propose three heuristical metrics for measuring antifragility that can be applied to data mesh.

The measure of antifragility involves assessing how a system responds to stress, volatility, and shocks. In more mathematical terms, we consider this as understanding the system’s non-linear response to perturbations. Have a look at Taleb’s Philosophical Notebook to see why non-linearity is essential.

Negative Convexity

In mathematical terms, Taleb uses the concept of negative convexity to describe antifragility. A system exhibits negative convexity if it benefits more from random fluctuations (in terms of magnitude) than it loses. More formally, an object or system is antifragile if it has a negative convex response to stressors. For instance, suppose a system’s performance is quantified by some metric P. For a stressor level S, if the change in P is more than linear, we might say that the system shows antifragility. Here’s an idealized chart, ignoring the stress reaction period where performance temporarily drops.

A convex stress reaction is antifragile: The system gets not just better, but better-better

Imagine a data mesh system where we measure performance by the time to query certain data (lower is better). At rest (S = 0), it might take 100 units of time. Under stress (S = 1), it might initially take 150 units of time. However, after the system adapts to stress (e.g., by optimizing data structures, rerouting data pipelines, and scaling up the Kubernetes cluster), the time might drop to 80 units. This disproportionate improvement signifies negative convexity and, therefore, antifragility.

Note that the antifragile stress recovery and adaptation is allowed to take some time; analogously, your athletic abilities don’t grow during or immediately after training. Here’s a generic stress-recovery-adaptation curve that shows the antifragility of our bodies:

Each of us is antifragile; we evolved that way (source)

Volatility and Variance

Antifragile systems benefit from variability, not suffer from it. So, an increase in system performance (like throughput or efficiency) with increasing input variability is an indicator of antifragility. Suppose we have an input variable V that is stochastic. An antifragile system shows improved performance as the variance σ² of V increases.

Improved performance under increasing variability is a core property of antifragility

For example, suppose V represents the rate of data input into a data mesh system, and we measure the system’s performance P by the consistency of its data output. If we find by an increase in P that the consistency improves as we increase the variability of V (i.e., the data input rate), this suggests that the system is antifragile.

Response to Radical Change

An antifragile system improves its function or efficiency in response to shocks and sudden stressors. If applying such a stressor to a system results in a subsequent improvement in its operation or function, it exhibits antifragility. We represent this mathematically by introducing a stressor S to the system and monitoring the system’s performance metric P before (P1) and after (P2) the stressor is introduced. If P2 > P1, then the system has demonstrated an antifragile characteristic. Instead of a simple comparison that can be misleading, it’s often better to use a one-sided t-test to check whether the performance did indeed increase because, in reality, all metrics fluctuate.

Not just back to business but stronger than ever—antifragile shock reaction

For instance, a data mesh system could face a stressor like a sudden surge in data volume. If the system’s response time (our performance metric, P) improves after facing this stressor and adjusting to it, we say that the system demonstrates antifragility.

There’s one big drawback to antifragile system design: Antifragility comes at a price. Even more than mere robustness, you need not just fail-safe mechanisms and layers of redundancy but numerous self-correcting and self-improving feedback loops, both technical and organizational, and these don’t come cheap. Still, contrast the cost of antifragility with a system that degrades in time and collapses under pressure. It’s no wonder that all living systems have evolved remarkable degrees of antifragility.

Jensen’s Inequality

Those who want to dive even deeper should consider Jensen’s Inequality, which is a fundamental theorem in the mathematical field of convex analysis, and Taleb uses this theorem to describe the core property of antifragile systems.

The inequality states that for a convex function f and a random variable X, the expectation of the function E[f(X)] is always greater than or equal to the function of the expectation f(E[X]); in mathematical terms: E[f(X)] ≥ f(E[X])

To translate this in terms of antifragility, let’s consider a system’s response to a stressor as a function f, and let’s consider a stressor as a random variable X. For an antifragile system, the expected outcome E[f(X)] from a variety of stressors is better than the outcome from the average or expected stressor, f(E[X]). In nuce: Antifragile systems have convex stress reactions.

In simpler terms, an antifragile system benefits from variability and uncertainty. It would rather face a variety of challenges and learn from each one (which might be small failures or stressors that don’t threaten the system as a whole) than face a constant, unchanging environment, even if that environment is benign.

This is a property we must accept in ourselves, and we would love to incorporate in our data meshes since we simply can’t guarantee unchanging environments, so we might as well use outside trouble as a driving force. Remember: Trouble means energy.

The world is full of power and energy and a person can go far
by just skimming off a tiny bit of it.

—Neal Stephenson, Snow Crash