Modeling Distributed Systems: Conceptual Background and Metrics

Published in

Analytics Vidhya

7 min readJan 28, 2020

This is the first in what will be an ongoing series in using ML and statistics to model the failure scenarios for complex distributed systems. As part of this series we’re going to cover the step by step process of how to build machine learning models that are able to out-perform humans in root cause analysis and performance tuning. In order to understand how to tackle the process through automation it’s important to first understand the processes that a human expert would use when troubleshooting or doing performance analysis.

A little bit of background

I’ve spent the past 10 years working with large-scale distributed systems both as a user and as a consultant. In my consultant role I’ve worked on 500+ deployments, typically some combination of Cassandra, Kafka, Spark, ElasticSearch, Solr, Kubernetes and Hadoop. Consultants typically have two roles, helping a company with strategy and advice as they’re building an application and then also fielding calls when everything goes wrong or there is a major performance problem. In both instances, you perform more or less the same workflow:

Gather the goals and intent of the user;
Collect the relevant metrics or logs; and
Systematically work through the various workloads and sub-systems to determine where the bottleneck or problem might originate.

There are two pieces of published work that lay out how to best accomplish this process: the USE Method by Brendan Gregg and the SRE book published by Google.

USE Method

A computer isn’t a monolithic thing, instead it’s a collection of sub-systems working in cooperation to achieve a task. These subsystems are things like the CPU, memory, disks, and network interfaces. Each of these systems have their own limitations and failure conditions. When a computer program is having performance or reliability problems it’s typically the result of one system being over-stressed or failing. Understanding why a system is slow or disentangling the flood of alerts and errors is difficult and often requires lots of experience based pattern matching and correlation. To make this process easier for newcomers Brendan invented a basic procedure to systematically break down a system into its component parts and isolate the source of the problem.

For every resource, check utilization, saturation, and errors

Example system block diagram taken from Brendan’s website, originally from the Sun Fire V480 Guide (page 82). — Example system block diagram taken from Brendan’s website, originally from the Sun Fire V480 Guide

USE is an acronym that stands for Utilization, Saturation and Errors and the method is half guidance on what to look for when spotting errors and half procedure. The trick is to take a look at your system, and break it down into all of the different sub-components and then “for every resource, check utilization, saturation, and errors”. If you do this systematically then you should be able to understand which systems are under stress, which systems are experiencing errors and form hypothesis about why the system is failing. Once you have a hypothesis you can perform further checks to validate your ideas or propose experiments like increasing IO, or caching certain values to fix the problem.

USE Method Workflow — USE Method workflow: identify systems, for each system identify utilization, errors and saturation, if you find anything interesting investigate. Repeat until you find the bottleneck or source of errors.

The USE methodology works well when you’re trying to troubleshoot a single application on a dedicated server, but what happens if you’re attempting to do this for a much bigger system like a distributed database or an application consisting of many micro-services?

Google, the four golden signals and long tails

In large distributed systems, things get a little bit more complicated. For example you’re dealing with hundreds or thousands of machines. How do you know which resource is bottlenecked? Do you check every machine? How do you know if an error on one machine is causing the the latency spike that a user is seeing? This is further complicated by the fact that distributed systems are noisy and dynamic. In a large enough cluster there is always something going wrong. Machines fail on a regular basis, application exceptions never drop to zero, and the cluster is frequently changing size in response to load. If you were to respond to every alert you’d quickly become overwhelmed. To deal with this complexity the

Your application is no longer a single program running a single machine. It’s a collection of hundreds or thousands of machines, and instead of a sub-system like disk or CPU, you monitor the characteristics of underlying services like queues or databases, each of which may be backed by many machines. To further complicate things, distributed systems aren’t constant. Machines fail on a regular basis, systems scale up or down in response to load, and communication between subsystems is performed over the network, which introduces its own source of noise. If you were to alert on every error or variance in latency, you’d quickly become overwhelmed.

What to measure

Chapter 6 of the SRE book, deals with the complexity of monitoring distributed systems by giving guidance on what and how to measure. They go through a lot of the same metrics highlighted in the USE method, but they replace Utilization with Latency and Throughput. Latency and throughput are concrete concepts that are mostly likely already exported by your system, whereas Utilization is somewhat abstract and difficult to measure in a system that’s always changing. So putting it together, we’re generally concerned with 4 sets of metrics (Latency, Throughput, Saturation and Errors) and we’ll be using these metrics going forward to characterize the work and stress for each system or sub-system that we’re monitoring.

Worrying about your tail

The other major take away from the chapter on monitoring is that you should worry about your tail. What is a tail? A tail is the higher percentiles of a probability distribution. Distributions are the key to getting a fuller understanding of what your system is doing. Unfortunately, most people and frequently the system (JVM garbage collection, ElasticSearch), make the mistake of reporting metrics only by their mean (average) and by doing so make it nearly impossible to troubleshoot.

It’s about the distributions

Distributions are important because they describe the overall behavior of your system and provide you with a much richer understanding of what’s going on.

latency percentile plot — Theoretical latency distribution with two peaks around 30 and 45 milliseconds

Just by glancing at the latency distribution plot above I’ve been able to obtain a very detailed understanding of the behavior of my system. I can see that the median latency is around 32 milliseconds and I also know that 95% of my requests are completed in about 55ms. This is a very tight distribution with very little variance in real world terms. Also, I see two peaks which means that there are probably two different sets of behaviors or request latency distributions that are encompassed by this graph. All of this information would be lost if I reduced latency to a single datapoint. Let’s go through some real world examples of how we put this information to use when troubleshooting.

Hey! Who are you saying has a ‘fat tail’?

As our first example as to why distributions are important, let’s go back to discussion of long tails. When dealing with a distribution of requests over time, if you measure the average response rate for a minute, 90% of your requests could be 3 milliseconds, 10% could be 300 milliseconds, and you’d have an average response rate of 32.7 milliseconds. This looks great, but if your SLO is a p95 of 100ms, then you’d be in violation and never know it. If you only record the mean (average), you’re losing important information including your ability to spot outliers.

Using only the mean of a metric can obscure important behavior. In this instance a small percentage of high latency requests caused the system to violate its’ SLO.

If you only record the mean (average), you’re losing important information, including your ability to spot outliers.

Holy s*%#, why are things on fire?

We’ve established that distributions give us a richer understanding of how a system is behaving, but most importantly, they give us a tool to investigate what changed when things go wrong. If you track your distributions over time, you can easily compare your baseline distribution from before an outage to the period of time when the alarm was triggered.

The distribution of this metric shifted right by about 30 milliseconds indicating a general change in behavior.

The shift in the distribution provides you with the clues necessary to perform the next level of analysis

If the distribution shifts to the right, it’s probably a general change point (new release, dataset increasing beyond memory size) or a general increase in load. If this is true, you can revert the build or scale the cluster. If a metric suddenly develops a spike, like the latency graph above, it’s probably an intermittent fault or a subset of the behavior is problematic. The subset could be a problem user, an expensive series of requests, or a machine experiencing hardware failure. Nonetheless, the shift in the distribution provides you with the clues necessary to perform the next level of analysis.

Wrapping up and next steps

To recap, we’ve established what to measure (throughput, latency, saturation, errors), how to measure (distributions), and a rough troubleshooting workflow from the USE methodology. However, there is one more item that’s required to put this into practice and that’s DAGs (Directed Acyclic Graphs). In the next installment, we’ll talk about how to construct DAGs for known systems and the general heuristics for isolating and attributing causality to service disruptions.