The Awesome Math of Data Mesh Observability: Causality
Causality brings with it a philosophy, a language, a syntax, and
it imposes a discipline.
—Judea Pearl
This is the second post in a loosely connected series of articles on the mathematical armory that can be brought to action in data mesh observability. As a quick recap, let me bring you up to speed:
Data mesh is a relatively new decentralized data management paradigm for handling analytical and research data that promises to avoid some of the worst conundrums of heavy-weight data warehouse and data lake initiatives, which have annoyed us long enough with incredibly high disappointment scores both during implementation and in production. The key to understanding data mesh lies in the application of distributed software architectures to the data realm—domain ownership, data productization, decentralized governance of domain representatives, and a central data platform with DataOps and policy automation under the hood are the four pillars of data mesh.
Observability is a service management paradigm borrowed from good old control theory and bent out of shape to fit the needs of complex software systems. The demand for observability is still growing, as modern distributed systems are largely intractable using merely old-fashioned logging and monitoring, and this is especially true for data mesh. The elevator pitch for observability goes like this:
The usual way to service complex systems follows a vicious cycle—an error, outage, or performance issue occurs, followed by a quick symptom fix, followed by an even more complex and intractable system that exhibits even more unforeseen issues that are followed by more symptom fixes, etc. Observability, in contrast, begins by shaping an abstract model of how the system works as a whole, then describes situations (big-picture states of the system that are of interest to the SRE and system owners), and then collects signals (events, logs, metrics, and traces) that are necessary and sufficient to diagnose the defined situations. When there are issues, it’s not just the system that gets repaired, but also the model, situations, and signals are updated.
As I have argued and exemplified at length in the previous post of this series, there are numerous nifty statistical tools to check whether a given signal or combination of signals is significant to diagnose a certain situation. But what comes next? Say your signal evaluation algorithm has recognized a specific situation; you have what’s called “situational awareness.” Is this enough?
Probably not. In addition to finding out what’s the matter with the patient, you need to find out, in the next step, what caused the situation. You’re interested in the why, you’re looking for causality.
Causality
What is causality? I’m aware that there are several deeply technical and contested discussions in philosophy, psychology, and physics out there, but strangely and thankfully, for practical purposes, a pragmatic definition works best.
Let’s start with a few general observations:
A cause comes before the effect
Although there are somewhat confusing ideas to the contrary, we assume for the sake of our sanity a continuously progressing time arrow. Every investigation in causality, consequently, must look at data from at least two different times. In other words, a single observation is insufficient to determine causality, and past events can’t be caused by future events (that’s the technical definition of teleology).
A cause is necessary for the effect
If some event Y would have happened even if an alleged cause X hadn’t happened, X can’t be said to be a cause of Y, even if X happened before Y and there’s a high correlation between X and Y. For instance, this morning, a crow cawed outside my window, then I had my coffee. Nevertheless, although the crow cawed first, then I had my coffee, and such a combination often happens in the morning, I would have had my coffee even if said crow was unavailable for the scheduled cawing.
Technically, you could even define X as a cause of Y if Y wouldn’t happen if X hadn’t happened. This funny construct is known as a “counterfactual,” and it’s fraught with difficulties. Imagine you’re forced to be part of a ten-strong firing squad. The poor victim soon dies of ten gunshot wounds, each of them fatal all by itself. According to the counterfactual definition of causality, you have not caused the death since the victim would have died anyway.
Causes can compound
There are three major classes of compound causes: Overdetermination, multicausality, and causal networks.
- Overdetermination: Consider again the firing squad. If an event E is caused by more than one preceding event, each of them already sufficient to cause E, event E is overdetermined. In this case, each of the events leading to E is “a” cause.
- Multicausality: Many events have more than one necessary precondition. Take, for instance, the everyday occurrence of a sunburn on your skin. More than just the sun being out and strong, you need to be outside yourself long enough, and you need to be badly protected. Only a concurrence of all three events is sufficient to cause a sunburn. In this case, we can compound all three prior conditions to “the” cause.
- Causal networks: In most practical situations, you have numerous potential causes influencing each other in various ways with various probabilities. You can model such things using a directed acyclic graph (DAG), where the nodes correspond to events and the arrows indicate causal relationships between the captured events. This is a fascinating way to study causality, including detection of overdetermination, multicausality, and spurious correlation; more on that below.
A pragmatic non-definition of causality
Paraphrasing the late great Gregory Bateson, who cheekily didn’t define information as “a difference that makes a difference,” I don’t define causality as follows:
A cause is a necessary difference, all things being equal,
that later probably makes a difference.
It’s not a definition. Rather, it’s a tool to let us talk about causation.
This non-definition consists of six parts:
- A cause is a difference: It must be different from ambient noise, hence measurable. Unmeasurable events aren’t taken to be causes of anything.
- A cause is necessary: See above. If a cause looks sufficient but not necessary, we have the overdetermined firing squad event, where the suspected cause is, in fact, part of a larger clusterfuck.
- All things being equal: If you’re able to control all variables except the expected cause X (independent variable) and the expected effect Y (dependent variable), you should see a significant, well, effect of X on Y. This is the whole raison d’être of controlled experiments.
- Later: See above. We assume the effect comes after the cause.
- Probably: Even if a cause produces an effect, say, just 93% of the time, we can still call it a cause. Causality is probabilistic.
- Makes a difference: The effect, like the cause, must be different from ambient noise, hence measurable. Unmeasurable events aren’t taken to be effects of anything.
How does this help in data mesh observability? One of the gravest situations (in the observability sense) in a data mesh is continuous usage decline. This is for several reasons, most importantly to justify the always-growing total operational cost of a data mesh: The more data products you have to maintain, and the more infrastructure and maintenance activity you have to finance, the higher your total operational cost. A continuous usage decline, then, would imply a reduced value generation, and when the total operational cost exceeds total value generation, your data mesh is an expensive hobby indeed.
Let’s say you have identified this dire situation. It’s now imperative to fix it, but obviously, this is not a software bug. It’s a people bug. Nevertheless, people’s behavior is greatly influenced by the functionality, usability, and governance of software platforms, so maybe the usage decline has been caused by performance issues or data quality degradation. These, now, are causal assumptions that can be checked and, when verified, be used to repair the situation.
Meet Granger Causality
As a side note, let me remind you of the old story of the drunk who searched his lost keychain under a streetlight instead of further away in the dark, where he’d actually lost his keys. When questioned about this curious behavior, he matter-of-factly replied, “I’m searching here because I can’t see a thing over there.” And so in IT: We often do what’s possible, not what’s sensible, doing the right thing being harder and more obscure. You could bring a flashlight—or you could put up more streetlights. One of those lights is Granger causality.
Granger causality is a statistical concept of causality that’s based on prediction. In an act of heroic simplification, Clive Granger has reduced the hypertrophied causality discussion to the predictability of mere time series data. According to Granger causality, if a signal X “Granger-causes” a signal Y, then past values of X should contain information that helps predict Y above and beyond the information contained in past values of Y alone. The math of Granger causality is not easy, but thank goodness there’s a lib for that.
In our scenario, we’d like to determine whether changes in data quality or performance can predict a decline in data usage. For the purpose of simplicity, let’s say that you have time series data for data usage, data quality, and performance. We can apply Granger’s causality to these datasets.
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import grangercausalitytests
# random data; rigged so that data quality impacts data usage ;-)
data_usage = np.linspace(100, 50, 100) + np.random.normal(0, 10, 100)
data_quality = np.linspace(40, 20, 100) + np.random.normal(0, 5, 100)
performance = [5]*100 + np.random.normal(0, 2, 100)
df = pd.DataFrame({
'data_usage': data_usage,
'data_quality': data_quality,
'performance': performance
})
# kill missing values
df = df.dropna()
# 1. check if data quality Granger causes data usage
quality_res = grangercausalitytests(df[['data_usage', 'data_quality']], maxlag=2)
lag = 1 # Specify the lag at which to extract the p-value
f_pvalue = quality_res[lag][0]['ssr_ftest'][1]
if f_pvalue < 0.05:
print("==> Data quality G-causes data usage with 95% confidence")
else:
print("==> Can't say that data quality G-causes data usage")
# 2. check if performance Granger causes data usage
performance_res = grangercausalitytests(df[['data_usage', 'performance']], maxlag=2)
f_pvalue = performance_res[lag][0]['ssr_ftest'][1]
if f_pvalue < 0.05:
print("==> Data quality G-causes data usage with 95% confidence")
else:
print("==> Can't say that performance G-causes data usage")
Bear in mind that Granger causality doesn’t necessarily imply true causality; it just suggests that one time series can be used to forecast another. A true causal relationship can be much more complex and may need a more sophisticated analysis or even a controlled experiment to be definitively proven. What’s more, Granger causality can be fooled by strong correlation. But it’s a nice tool to have and easy to use too.
Meet Mr. Pearl
One of the most prolific researchers of probabilistic causation is Judea Pearl. It’s no exaggeration to say that you can’t speak properly about causality without speaking about Pearl. Turing Award laureate and first mover in the field of probabilistc reasoning, he’s been described as “one of the giants in the field of artificial intelligence.”
He more or less single-handedly invented Bayesian networks and contributed more to machine reasoning than any other researcher. He started his career with training in electrical engineering and physics, and his formative years where strangely concentrated on the tangible and deterministic nature of transistor technology; it’s not entirely clear when and why his focus switched to AI, but starting his professerhip at UCLA, he was already a fully-fledged cognoscente of probabilistic reasoning.
Even if you won’t peruse his opus magnum “Causality: Models, Reasoning, and Inference” to understand the backdoor criterion, the front-door criterion, and his powerful do-calculus, and even if you somehow can’t find the time to read his dumbed-down “The Book of Why”, you can infer from his reaction to the murder of his son, journalist Daniel Pearl, with the founding of a non-profit for the reconciliation between Jews and Muslims, that he’s used to looking for deeper causes:
Hate killed my son. Therefore I’m determined to fight hate.
—Judea Pearl (source)
Now what has this to do with data mesh observability? Much. A data mesh is a complex entity with numerous causal and probabilistic dependencies—technical, organizational, and psychological. You can use causal graphs to model those dependencies, and then run some do-calculus (yes, there’s a Python lib for that) to check the profundity of some intervention or influence. Here’s a simplified causal graph that aims at modeling the multiple possible reasons for increased or reduced data usage. If the modeling is done about right, and if your inference is performed correctly, you’ll be able to gain insights that no other tool except a a host of controlled studies can surpass.
This is, of course, an oversimplification. Important things like policies, tools, incentives, and feedback loops have been neglected. But even in this lightweight shape, it should become clear that data mesh causality is a deep topic. Read your Pearl.
King’s Road: The Controlled Study
I’ve hinted at the superiority of the controlled study, so I feel somewhat obliged to show you how it could be done.
A controlled experiment is one of the most direct ways to establish causality. This involves identifying the factor you want to test (the independent variable) and controlling all other variables to determine if changing the independent variable causes changes in the outcome (the dependent variable). The word “controlled,” here, means to fix all parameters except your independent and dependent variables, preferably keeping them constant or at least within tight bounds.
In our case, we’re interested in whether data quality or performance impacts data usage. First of all, unfortunately we can’t do both with a controlled study; we must examine each factor individually. That’s the main drawback, especially compared to the magnificent modeling capacity of Pearl’s Causal Calculus.
Here’s an outline of how you might structure controlled experiments for both of these factors. If you want to ramp up the study’s quality to account for placebo and observer effects, use a placebo-controlled randomized double-blind study, where group membership is assigned randomly and neither group members nor researchers know who’s in which group.
First Experiment: Data Quality
- Hypothesis: Changes in data quality cause changes in data usage.
- Treatment Group: This group receives data with deliberately induced quality issues (missing values, incorrect entries, etc.)
- Control Group: This group receives data of normal quality.
- Outcome to Measure: Data usage (could be measured in terms of frequency of use, volume of data accessed, etc.)
Second Experiment: Performance
- Hypothesis: Changes in performance (latency, download speed) cause changes in data usage.
- Treatment Group: This group receives data with deliberately induced performance issues (slow response times, reduced availability, etc.)
- Control Group: This group receives data with normal performance.
- Outcome to Measure: Data usage (frequency of use, volume of data accessed, etc.)
The key with these experiments is to ensure that you’re only changing one variable at a time (either data quality or performance), while keeping all other variables constant. This way, any observed changes in the outcome can be attributed to the changes you made in the independent variable.
Crucially, these experiments would have to be done carefully and ethically, especially if they’re done in a live environment. They might negatively impact people out there, so it’s important to consider this when designing and implementing the experiments. Always ensure that you’re not causing undue inconvenience to the users and that you’re complying with all relevant regulations and ethical guidelines. End of disclaimer.
Give Me the Code
Once you have the data on your treatment group and your control group, you can run a simple t-test to check whether your intervention brought an effect by comparing the means of both groups. This is a crude but effective tool.
from scipy import stats
# control_group and treatment_group are numpy arrays or pandas series
t_statistic, p_value = stats.ttest_ind(control_group, treatment_group)
print(f"T-statistic: {t_statistic}, p-value: {p_value}")
# if p < 0.05, we can be 95% confident of a causal relationship
# between the independent and the dependent variable
Of course, the t-test tells you nothing about the magnitude of the difference. Here, Cohen’s d as a measure of effect size comes handy. Cohen’s d is defined as the difference of the group means, divided by the pooled standard deviation:
def cohens_d(group1, group2):
n1, n2 = len(group1), len(group2)
var1 = group1.var()
var2 = group2.var()
pooled_var = ((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2)
return (group1.mean() - group2.mean()) / np.sqrt(pooled_var)
d = cohens_d(treatment_group, control_group)
print(f"Cohen's d: {d}")
Cohen’s d is a measure of how much difference there is between the two groups, in standard deviation units. As a rule of thumb, d < 0.2 is considered a small effect size, d ≅ 0.5 represents a medium effect size and d > 0.8 a large effect size.
If you want to pimp your scripts, you could also use ANOVA <L> and Tukey’s HSD post-hoc test for group comparison, but the works is the same: You end up with a p-value that lets you check for a significant deviation.
Controlled studies, if you can get them financed, can be great fun. Still, never underestimate the weirdness of people, the complexity of the universe, and the vastly simplifying assumptions built into statistical tests such as normally distributed data and homogeneity of variance, which indirectly imply that you need large groups to be approximately right. And watch out for your confirmation bias.
The journey to a mature data mesh is a complex endeavor; without observability you’ll fly blind, and without insights into causality the mesh will degrade. Entropy never sleeps.