The Awesome Math of Data Mesh Observability: Significance

Hannes Rollin
6 min readJul 19, 2023

--

The world rewards those who find things others haven’t found, but doesn’t reward well those who were merely trying to be original.

—N. N. Taleb, The Bed of Procrustes

Some observations are indeed significant (DALL·E)

Observability, a paradigm borrowed from control theory and adapted to IT system supervision, is a top-down approach to understanding complex systems, especially their diseases or “situations” in observability. As I’ve explicated elsewhere, opaque interconnected systems like data mesh architectures benefit hugely from the observability approach, where mainstream logging and monitoring is just not enough.

Typical data mesh situations are these (note that, except for #4, these situations are rather general and fit nearly every distributed IT system; you should always define additional situations specific to your mesh):

  1. Performance Degradation: Deteriorating system performance over time, often detected by increased response times, decreased throughput, reduced service reliability, etc. This is an early-warning situation for the direr resource saturation situation.
  2. Usage Decline: A decrease in the usage of the mesh, detected by metrics such as lower API calls to the data platform, fewer active users, reduced data usage, etc.
  3. Resource Saturation: This occurs when the system is running at its capacity limit, resulting in performance degradation or complete failure. Detected via system metrics like CPU usage, memory utilization, disk I/O, network bandwidth, etc. It’s nice to have been warned by a recognized performance degradation earlier.
  4. Data Quality Issues: If there are indications of a decline in data quality — for instance, missing data, inconsistent data, or an increase in data errors — observability can help pinpoint the source of the issues. This could be during ingestion, transformation, or even at the point of data consumption.
  5. Increased Error Rates: Spike in errors or exceptions occurring within your services, which can be detected by error logs, failure metrics, and exception traces.
  6. Service Disruptions: Complete failure of one or more services, causing disruption in the overall system functionality. Service health checks, error logs, and transaction traces can help identify this.
  7. Security Incidents: Potential security breaches or attacks detected by unusual activity patterns, multiple failed login attempts, unrecognized IP addresses, etc.

Of course, you have access to numerous logs, metrics, events, and traces—these are called “signals” in observability parlance—but how do you verify whether certain signals are strong or strange enough to indicate a particular situation or, in other words, whether a particular set of signals is significant?

Significance in Performance Degradation

Let me try and keep this concise by homing in on one situation that is rarely well executed: Performance degradation. Often, site reliability engineers or admins rely on pre-defined thresholds and fancy dashboards, i.e., the power of human pattern recognition. Still, despite being visually adept, people are biased in countless ways, sometimes seeing things that are not there (fancy words for everyday things: apophenia, pareidolia, and clustering illusion) and often missing things that are indeed relevant (read up on base rate fallacy, change blindness, and confirmation bias).

That’s why we have statistics. I mean real statistics, not the ubiquitous bar graphs that colloquially pass for statistics.

Performance degradation is specifically important in data mesh observability since the decentralized architecture and unforseeable productization of nested data products will affect performance in strange ways. I would bet on quadratic or higher performance degradation, given linear growth in data product creation and data usage. So much for your AWS bills.

Here are four mathematical power tools to check for performance degradation in a data mesh.

Tested Linear Regression Model

This can be used to identify a trend over time in metrics such as response times, CPU usage, or memory consumption. An increasing trend over time could indicate performance degradation. Use a linear regression combined with a Wald test on an assumed t-distribution to check if the regression line’s slope deviates significantly from zero. In plain English, if p/2 < 0.05, it can be said with 95% confidence that things are getting worse.

from scipy import stats
import numpy as np

# normal-distributed random data, should not decline
response_time = np.random.normal(0.5, 0.1, 100)
seconds = np.array(range(100))

slope, intercept, r_value, p_value, std_err = stats.linregress(seconds, response_time)

# we have to halve the p-value, since linregress uses a two-sided test
# with H0: slope = 0
if slope > 0 and p_value / 2 < 0.05:
print(f"Likely performance degradation. p-value: {p_value / 2}")

Exponential Smoothing Model

This model takes into account the effect of time and the possibility of sudden spikes or drops. It’s useful for analyzing metrics like error rates, number of timeouts, or failed requests. An increasing trend in these metrics implies performance degradation. Augmented by the Mann-Whitney U Test, it can even be taken as a basic anomaly detection algorithm. What we do is the following: Analyze past error rates, predict future error rates, and compare the actual occurring new error rates with the predicted ones.

from statsmodels.tsa.holtwinters import ExponentialSmoothing
from scipy.stats import mannwhitneyu
import numpy as np

# random error rates for 100 time points (mean = 0.05)
error_rates = np.random.normal(0.05, 0.01, 100)

model = ExponentialSmoothing(error_rates)
model_fit = model.fit()

# predict error rates for the next 10 time points
predicted_error_rate = model_fit.predict(len(error_rates), len(error_rates) + 9)

# generate "actual" higher error rates for the next 10 time points
actual_error_rate = np.random.normal(0.06, 0.01, 10)

# perform test to see if the actual error rates significantly
# exceed the predicted ones (the U statistic counts the number of times a
# predicted value is higher than the actual one)
statistic, p_value = mannwhitneyu(predicted_error_rate, actual_error_rate)
print(f"U-statistic: {statistic}, p-value: {p_value}")

if p_value < 0.05:
print("Looks like an anomalous performance degradation...")

Multivariate Anomaly Detection Model

This can be used to find outliers in multidimensional datasets. You might use this model with metrics such as request processing time, database query time, and network latency. Anomalies in these metrics indicate performance degradation. The charming point lies in the word “multivariate”: You can actually compound several signals and apply the model to all of them. Downside: You won’t know why exactly a data point is an outlier; that’s one reason why I’ll tackle causality in a future post. Here’s how it works.

from pyod.models.knn import KNN
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# random data for 100 time points
request_processing_time = np.random.normal(1, 0.2, 100).reshape(-1, 1)
database_query_time = np.random.normal(2, 0.3, 100).reshape(-1, 1)
network_latency = np.random.normal(0.01, 0.002, 100).reshape(-1, 1)

# scale features to the same range
scaler = MinMaxScaler()
data = scaler.fit_transform(np.hstack((request_processing_time, database_query_time, network_latency)))

model = KNN() # k-nearest neighbor
model.fit(data)

# some new data points...
new_data = np.array([[1.1, 2.2, 0.012], [1.3, 2.1, 0.013], [0.9, 2.3, 0.014]])
new_data_scaled = scaler.transform(new_data)


anomaly_predictions = model.predict(new_data_scaled)
print(anomaly_predictions)

Output:

[0 0 1]

This means that the third “new” data point is an outlier. Please note that k-Nearest Neighbor doesn’t isolate a single anomalous variable; rather, it indicates a considerable distance to the central cluster in (here) three-dimensional space.

Trend Detection Using Control Charts

Control charts, also known as Shewhart charts or process-behavior charts, are a statistical process control tool useful for studying how a process changes over time. It’s a graph that shows the process variable (e.g., response time, error rate, etc.) over time plus three control lines: the center line (usually the mean) and the upper and lower control limits.

These limits are typically set as ±3 standard deviations from the center line. Roughly 99.73% of data points should fall within these limits if the data follows a normal distribution.

import numpy as np
import matplotlib.pyplot as plt

# random data for 100 time points
latency = np.random.normal(0.5, 0.1, 100)

center_line = np.mean(latency)
standard_deviation = np.std(latency)

upper_control_limit = center_line + 3 * standard_deviation
lower_control_limit = center_line - 3 * standard_deviation

plt.figure(figsize=(10, 5))
plt.plot(latency, linestyle='-', marker='o')
plt.axhline(y=center_line, color='r', linestyle='--')
plt.axhline(y=upper_control_limit, color='g', linestyle='--')
plt.axhline(y=lower_control_limit, color='g', linestyle='--')
plt.title('Control Chart')
plt.xlabel('Time')
plt.ylabel('Latency')
plt.show()

Output:

A nice control chart: Everything is under control

Naturally, you don’t want to leave it to human observers to watch charts all the time. What about Sunday, 2 a.m.? What about the extemporal coffee break? What about being bored out of your mind? You need some automatic rules. Some of the most commonly used rules are known as the Western Electric Rules and the Nelson Rules. Some of these are:

  1. One point is outside the 3σ limit
  2. Two out of three consecutive points are outside the 2σ limit
  3. Four out of five consecutive points is at a distance greater than 1σ from the center line

This shouldn’t be too hard to implement.

This is the first of several posts on data mesh observability, each outlining one major topic with tested code samples and a pinch of irony. Even if you’re an experienced data guy, you need to try new things to keep your edge, or, as the grand Taleb said,

Modern life is akin to a chronic stress injury

Coming soon: Causality. Stay tuned!

--

--

Hannes Rollin

Trained mathematician, renegade coder, eclectic philosopher, recreational social critic, and rugged enterprise architect.