Using TLA+ to Model Cascading Failures

Marianne Bellotti
Apr 10 · 15 min read
Business vector created by fullvector

TLA+is traditionally used to model the algorithms that power distributed systems. However, most engineering teams do not build their own message queues, they install and configure multiple solutions developed by other people. The biggest threat to these teams is how all those solutions interact together at scale.

Typically, engineering teams search for those threats by doing live failure tests. Either running extensive performance tests on a special testing environment, or actually sabotaging their own infrastructure to trigger a response ala Chaos Monkey. These are great practices, but somewhat resource intensive. And because they are resource intensive, organizations tend to put them off.

Modeling and its Discontents

There’s a popular saying a statistics: All models are wrong, but some are useful.

I’ve been played around with TLA+ for a while and was curious if it could be used to model the interactions between systems well enough to catch the kinds of complex failures that happen when distributed systems are chained together. I wasn’t sure it was possible to reach a“math proof” level of accuracy between the model and the real life system it was built to reflect but I was eager to try.

That being said, the purpose of this type of modeling is not to build perfect systems, but to help your team define and clarify assumptions. This kind of modeling forces the team to ask themselves exactly how the various products and services they are using interact and it can help find places where we actually don’t know what the expected behavior is, which is useful in minimizing complex failure but will not totally eliminate it.

The hard thing about modeling is figuring out when to simplify and when to specify. The whole point is to uncover details that in retrospect seem obvious but were otherwise not top of mind. But if one tries to capture every detail in the model it will likely be too complex to run.

Example One: The Naive Cluster

Let’s start off with a basic situation. We have three VMs. Each of them has a finite amount of resources, we can think of this as memory, CPU, or I/O but to be frank we don’t need that level of specificity. The important thing is that once the resources are exhausted on any single VM it is no longer healthy.

We start off with something like this:

In this model we’re going to use 0–10 to represent resource use percentage. We could use 0–100, but that would increase the amount of states TLA+ has to check rather dramatically without really providing any additional value. The way TLA+ works is that it runs all possible sets of input looking for places where requirements that you have defined (assertions, invariants, temporal properties) are violated. It is unlikely that for this model an input of 57% will produce an error that an input of 55% didn’t already catch, so we take the opportunity to simplify and use 0–10.

We then define an invariant called ServersHealthy in which we tell TLA+ that every server in our cluster should be below 90% resource use. We initialize our servers at 40% resource use to start. Why 40%? Here’s a great example of how model checking can lead to interesting conversations about what we really know with any given architecture. We know that when we spin up a new server it is using some memory and some CPU… that number is not 0%, but it’s also not 100%. What number in between is the most accurate? How would we figure that out? Should we be tracking this over time?

Anyway, let’s say we did all that and we decide on an initial state of 40% utilization. This is where we need to learn how to think about a system of systems as an algorithm instead of an environment. Although in the middle of an outage or complex failure it might seem like things are happening at random, in reality each system has a defined set of rules that govern its behavior and choices, and each system it hands off to has a defined set of rules that govern its behavior and choices. Just because the exact set of inputs is unexpected does not make the behavior random.

We want this first model to be naive so we’re going to define an incident as a series of events that change the amount of resources being used by one of our VMs. You can think of these as processes eating up CPU or requests using memory, whatever.

This defines an event as having a target (one of our possible three VMs already defined) and a size (a range from not increasing resource use to increasing it by 4). We’ll also define an incident as being a set of four events. Don’t worry you’ll get to see this altogether in one file at the end.

Why four? It’s pretty much an arbitrary number at this point. We want to see how multiple incoming processes/requests affect our VMs so we need more than one, but the more events we have in an incident the more combinations TLA+ has to consider, the longer the full model takes to run. So we settle on four as the sweet spot where we have enough events to see problematic behavior but not so much that we need to grab a coffee and come back to check the model’s results.

Now the meat and potatoes of our algorithm. As an event occurs it increases the resource use of one of our VMs.

I’m doing this as a macro trigger_event because we’ll want to add more complexity to it soon and breaking it out will keep things clean. Here’s what the full code looks like:

Running this produces a sensible and predictably uninspiring result:

TLA+ let’s us know that if we have three VMs at 40% of resource use, a process that increases resource use by 20% followed by another increase in resource of 40% on the same server will result in that VM being unhealthy… which is not something you would need to build a model to figure out but is helpful at demonstrating the concept of using TLA+ to model systems integration.

The core problem with this model is that it doesn’t capture a few essential components of behavior. We have a magical load balancer that apparently has no logic defining which VMs get which traffic, and processes consuming resources indefinitely.

Let’s change some of that.

Example 2: Load Balancing

To add a load balancer to the mix we’re first going to change our VM names to integers. There was no particular reason to have their identifiers be strings, except that it made the code a bit easier to read, but by changing them to integers modeling a round robin load balancer becomes just a matter of counting to three over again.

We don’t need our events directed to a specific VM anymore so we’ll just get rid of that. Then we add some basic conditional logic to trigger_event so that the model directs the next event to the next VM in our round robin rotation. Altogether we have:

You probably don’t need to run the model to guess what’s going to happen. The fourth event will go back to the first VM and knock it over our limit. To improve our model we need to program in recovering resources when processes end.

There are a couple of different ways we could approach that. We could give each event a timeout after which its load on the VM is cleared away, but that creates an unhelpful amount of complexity in the model. We’re only doing four events per incident, it’s not likely that timeouts would make enough of an impact to give us any insight we didn’t have before.

Another way of approaching it is to think of the reclaiming of resources as garbage collection. One event is not one process or request raising the resource use by a single VM, it is just a state change that could have been triggered by any number of things. Maybe Event X increases VM 2 resource use by 20% and Event X is actually a number of processes that happened in an undefined period of time. Then we assume at various points that some of those processes have ended naturally, some have timed out, but that those resources are still “used” because the machine hasn’t released them yet.

To represent that we have to rework our algorithm into processes because processes are how TLA+ represents concurrency and garbage collection should happen independently of any events happening to the server. Because we’re breaking this up into processes, we’re also going to move the variable initializations around a little bit. Some variables are only relevant to certain processes, it makes sense to keep them local. Any macros we call within the process will be able to access them.

What’s important to keep in mind when designing TLA+ models is that the model checker will test all possible combinations of inputs. Not the most likely. So this code will include all the likely scenarios where some VMs free up a lot of resources and some free up only a few, but also scenarios where no one frees up anything.

Which obviously will put us back where we started, despite that scenario being unlikely.

Going back to the all models are wrong, but some are useful maxim. Is this a useful level of insight? Not any more than having no garbage collection at all was. It is tempting to force the model to always do garbage collection just to get the model to pass, which is also not useful.

What we could do instead is reconsider ifServersHealthy should be an invariant the way it is. When we define an invariant, TLA+ will error if that statement is false for any reason, at the end of every step. ServersHealthy defines our model as being in a failure state whenever any of our three VMs are unhealthy, but in reality VMs do occasionally become unhealthy and go offline and it does not generally end Western civilization as we know it.

If we change the \A (All VMs are good) in our invariant to \E (at least one good VM Exists) the model runs without any failure states:

But that’s not enough… We can do better. We can add even more complexity.

Example 3: Autoscaling

There are lots of things about our models that are still lacking. Obviously if we were to move beyond four events our outcomes would look very different. And we’re missing a big piece of modern cloud architecture: autoscaling. So let’s rethink the model a bit.

We’re currently representing the VMs in a struct which in TLA is kind of like a hash map. This made assigning resource increases to various VMs super easy but it isn’t the best option when we want to add and remove VMs from the pool. Instead we’re going to store this information in a sequence, basically a list of numbers wherein each number represents the resource use of the VM.

[1..2->0..10] creates a set of sequences that represents all possible combinations of two VMs with a potential resource use from 0% to 100%. Creating a series of traffic patterns as in previous models was a bit easier to conceptualize, but not very efficient. There’s also a question of whether the right combination of state changes will bring both VMs to a place that is unsustainable. In our previous model we survived 4 cycles, but probably would not survive 8. By creating a set of every possible combination we no longer need cycles at all.

But this change requires us to remove our invariant and replace with something else:

This is a temporal property it’s defined and configured in much the same way as an invariant but it lets us use <> which in TLA+ means “eventually true”. If we were to keep the invariant then the model would fail as soon as TLA+ gave it the sequence<<10,10>>

This model doesn’t really load balance… honestly I went back and forth on this. Ultimately when a new VM is added to the pool, what we should expect to see is resource use even out as traffic is distributed across a larger group of machines. But if we aren’t doing cycles anymore I’m not sure that reflecting this behavior in the model is useful. So instead we have three processes: one that kills an unhealthy VM, one that adds a VM to the pool when resource use is high, and one that shuts down VMs when resource use is low.

There are a few unfamiliar things here. First await tells TLA+ not to run the process until the statement that follows it is true. VM_dies and scale_down will not run if the current pool has less than two VMs in it. This keeps all our servers from being turned off at once.

Oh except…. our model has found a case where that’s not true:

If both our VMs are at 80% usage and the kill process runs before the scale up process, they will both be killed. VMs := SelectSeq(VMs, LAMBDA x: x < 8); selects a new sequence from the original VMs, picking only the elements where the lambda returns true, in this case only VMs with less than 80% usage … in effect removing unhealthy machines from the pool.

And the pool doesn’t replace them because \E v \in 1..Len(VMs): VMs[v] > 6; returns false when applied to an empty sequence. So let’s change that by adding and Len(VMs) < 2 \/ to the beginning which will make it true if we have less than 2 VMs in our pool.

But even though we can control for this behavior by tweaking the model. This is not a fault in our model exactly. Actually it raises a really important question: what does our autoscaling service do when all the servers are unhealthy and it cannot spin up new ones? This actually happened to Buildkite a few years ago when some misconfigured settings triggered failure notices on otherwise healthy servers. In fact it’s exactly the sort of issue that tends to trigger cascading failures.

Right now we’re just going to update the model to prevent this, but if I were modeling a real system what I might want to do instead is build another model that looked at the process of spinning up new servers in detail. This is an important design practice for TLA+, because we are considering every possible state we don’t want to be adding more and more complexity to the same model. Instead we want to build more models. Using broad but simple business process models to find the edge cases, then building models that look at how we do those specific components in detail and testing for those edge cases.

Altogether here’s our finished model:

The fair keyword tells TLA+ that it is not necessary for those processes to run in order for the model to complete. We might never need to kill a server, for example, so TLA+ shouldn’t wait for us to have to.

Using Modeling as Part of the Development Process

All this seems like a lot of work, where does it connect to a normal development process? I see the benefit of modeling being primarily about communication of assumptions and gathering insight into the seams that join complex infrastructure. Good models can help you decide things like:

  1. Technology selection. Too often we choose tools based on what we’re already familiar with or what is trendy and cool. Modeling can help you determine what behaviors you’re looking for from the components of your architecture
  2. Monitoring and alerting strategy. Knowing how systems are likely to fail can help you figure out what to monitor without having to wait for failure to happen.
  3. SLOs. Part of the process of modeling is figuring out what the desired outcomes actually are before you’ve written any code. You can take those conclusions and draft your SLOs from them.

Make no mistake, complex systems will always have unexpected failures. Nothing can completely prevent them, but modeling can improve your awareness around them and give you an edge in cutting them off before they lead to outages.

Other Models To Look At

If you’d like to see the code and configuration from my models, they’re here on Github. Also thanks as always to the amazing Hillel Wayne for helping me track down the occasional TLA+ bug. Go buy his book!