Modelling Exactly-Once Using TLA+ Part 2: Checking a Specification

19 min readAug 7, 2024

TLA+ is a language for writing system specifications and discovering fundamental design errors quickly. In this series of blogposts, we’ll introduce TLA+ by iteratively modelling and refining our design for a simple application that can read and write messages from/to Kafka in an exactly-once fashion. This will be a four-part series consisting of:

Writing a Specification
Checking a Specification
Modelling Failures and Transactions
Modelling Zombies and Zombie Fencing (coming soon)

You can find all of the code for this series in this GitHub repo.

Also, if you have interesting systems engineering problems and are looking for a strong Software Engineer, feel free to reach out via LinkedIn.

Introduction

We left off in Part 1 with a question; how do we know our spec is correct? TLA+ provides only a grammar for specifying systems. Nothing about the way TLA+ is written guarantees correctness. In fact, there are several things that are wrong about our specification the way it’s written currently. We need another tool to verify our designs. This is where TLC comes in.

As noted on the official GitHub project page, TLC is a model checker for specifications written in TLA+. In short, it works by generating all the possible system behaviours permitted by the spec and then checking they don’t violate any of the desired properties. In this blogpost, we’ll see how we can use TLC to check the specification we wrote in part 1.

Heads-up; at almost 5,000 words, this is the longest post of the series. That said, the payoff will be worth it as in this post, we will lay the foundations we need to be able to refactor our spec confidently in parts 3 and 4 as we discover problems with the exactly-once-ness of our design.

Running TLC

The first thing we need to do is define a new file with the same name as our TLA+ module; ExactlyOnce.cfg. Inside this file, we need to at a minimum declare our system’s specification formula like so:

SPECIFICATION Spec  \* 1

TLC will use to use the declared SPECIFICATION formula Spec to generate all the possible system behaviours.

Next, you’ll need to get a copy of the latest version of tla2tools.jar which is available for download from the official TLA+ GitHub project’s releases page. Once you have that in your working directory, we can run TLC just like any other regular Java program using:

java -cp tla2tools.jar tlc2.TLC ExactlyOnce.tla

The only required argument is the name of the TLA+ file we want to check. We don’t need to supply the ExactlyOnce.cfg file explicitly as TLC will automatically look for a .cfg file with the same name as our module in the current directory (this is why it’s important to name your modules and config files consistently).

Unfortunately though, when you run this command, you’ll see TLC throws the following error:

Error: The constant parameter NULL is not assigned a value by the configuration file.

Let’s try to understand what’s going on here.

Specifying Constants

Recall that in Part 1 of this series, we declared a CONSTANT NULL parameter in our TLA+ spec without actually specifying it’s value. While that’s totally valid in TLA+, TLC needs us to actually assign values to constants in order to be able to check a spec.

The question is, what value should we assign to our NULL parameter? This might seem a little paradoxical given that NULL is supposed to represent a missing value 😖 The trick here is to assign NULL a (sentinel) value that our system otherwise never expects to encounter for “real” data.

If we look closely at our spec, we can see that Buffer is the only variable that can equal NULL in our spec. Furthermore we can observe that, aside from NULL, Buffer is only ever equal an integer greater than 1 because of our decision to represent messages by their position inside the InputTopic sequence and the fact that sequences are 1-indexed in TLA+. As a result, we can assign the value 0 to our NULL parameter, since that is a value our system otherwise never expects. We can declare this in our ExactlyOnce.cfg file like so:

CONSTANT NULL = 0  \* 1

Declaring a value for a constant is pretty easy. We can just use the CONSTANT keyword followed by the name of the constant (as defined in our TLA+ spec) followed by an equals sign and then the actual value.

If you try running TLC again, you should no longer see the error about an unassigned constant parameter.

Checking for Deadlock

Unfortunately, you’ll see a different, possibly more alarming error:

Error: Deadlock reached.

It’s not quite as bad as it sounds. We’ve actually made some progress as this time TLC has actually run and checked some system behaviours. Unfortunately, TLC discovered that, in at least one sequence of steps, our system becomes deadlocked, which is another way of saying that after a certain number of steps, our system has no more steps it can take and effectively becomes stuck. TLC will even tell us in which state our system got stuck:

Error: The behavior up to this point is:

State 1: <Initial predicate>
/\ CommittedConsumerOffset = 1
/\ OutputTopic = <<>>
/\ InputTopic = <<1>>
/\ Buffer = NULL

State 2: <ConsumeProduceCommit line 20, col 5 to line 24, col 27 of module ExactlyOnce>
/\ CommittedConsumerOffset = 2
/\ OutputTopic = <<1>>
/\ InputTopic = <<1>>
/\ Buffer = 1

The question is why do we get stuck? If we look carefully at our spec, we’ll see that we’ve really only defined one possible step that our system can take; ConsumeProduceCommit. And recall that our definition of ConsumeProduceCommit states that our system can only take this step if MessagesAvailable is true, as shown here:

ConsumeProduceCommit ==
    /\ MessagesAvailable
    /\ Buffer' = InputTopic[CommittedConsumerOffset]
    /\ OutputTopic' = Append(OutputTopic, Buffer')
    /\ CommittedConsumerOffset' = Buffer' + 1
    /\ UNCHANGED InputTopic

If we look at State 2 in the error message above, the state in which our system becomes deadlocked, we can see that MessagesAvailable must be false as our CommittedConsumerOffset (2) is greater than the length of the InputTopic sequence (1). Thus we reach the same conclusion as TLC, that there are indeed no further steps our system can take from this state given how we’ve specified our system.

Now, you might be tempted to argue that nothing is really wrong. Our application has successfully processed all the messages that were available and there are no more messages available to consume. Effectively, you could say our system has finished it’s work and so there shouldn’t be any further steps/states after that. Whilst this is a valid line of reasoning, in general in TLA+ we prefer to model systems as an infinite sequence of states.

This leads us to what may seem like another paradox; how do you model systems that can terminate as an infinite sequence of states? 🤔 The trick here is to model termination as a step, in which nothing changes, and that we can take infinitely and exclusively after we’ve taken it once. Yet again, this is just a matter of defining another operator in our TLA+ spec:

Done ==                    /* 1
    /\ ~MessagesAvailable  /* 2
    /\ UNCHANGED vars      /* 3

Here we’re defining a new step as an operator Done.
We specify that we can only take this step once we’ve consumed all the available messages.
Lastly, we specify that none of the variables in our system can change their values during this step as nothing be should happening if our system is terminated.

We’re not done yet though (pun intended). Our new Done operator isn’t being referred to anywhere in our spec so let’s fix that:

Next == 
  \/ ConsumeProduceCommit  /* 1
  \/ Done                  /* 2

Previously our Next operator was defined as equal to ConsumeProduceCommit. Our system could always take this step if there were messages available to be consumed.
Now, we’ve expanded our definition of Next so that it is a disjunction (note the use of \/ rather than /\) of two potential next steps; either our system can take a ConsumeProduceCommit step OR it can take a Done step, depending on the state of our system.

Looking at our definition of Next now, it should be obvious that we’ve covered all possible scenarios and deadlock should be impossible as there is always a step our system can take, even if it’s the same one as before. We can verify this by re-running TLC:

java -cp tla2tools.jar tlc2.TLC ExactlyOnce.tla

You should see the command runs successfully this time without raising any further errors.

The absence of deadlock is a such a commonly desired property that TLC will check for this automatically by default. However to discover other problems in our spec, we need to define our own checks.

Checking for Termination

The next thing you typically want to check for in your system is termination i.e. does our system always eventually terminate? From our previous discussion whilst checking for deadlock, you might be confident that’s already the case but it’s always worth double-checking your assumptions explicitly.

In order to check for termination, we first need to decide what constitutes a terminal state in our system. In our case, we already figured this out when we were writing our Done operator; it’s when there are no longer any messages available to consume. We can therefore define termination in our TLA+ spec as follows:

Termination == <>[](~MessagesAvailable)  \* 1

We have a new symbol here: <>. Although it might look similar to <<>> which we used previously to declare sequences, the two operators are entirely unrelated. <> is another temporal-logic operator used to express the idea of eventually i.e. we can read an expression of the form <>F as “eventually F is true.”

We’ve already previously discussed the meaning of [] which can be summarized as always. We can compose these two operators into a temporal formula of the form <>[]F which is generally read as “eventually F is always true.” To avoid ambiguity, we can also read it as “eventually, F becomes true and remains true always thereafter.”

Thus we can read our definition of Termination as “eventually no messages are available always.”

In order to check that Termination holds, we need to declare this property in our ExactlyOnce.cfg file like so:

PROPERTY Termination  \* 1

We can use the PROPERTY keyword to declare any properties we want TLC to check are eventually true in every behaviour exhibited by our system.

If we try running TLC now, you might be surprised to discover that our system doesn’t actually always terminate:

Error: Temporal properties were violated.

Error: The following behavior constitutes a counter-example:

State 1: <Initial predicate>
/\ CommittedConsumerOffset = 1
/\ OutputTopic = <<>>
/\ InputTopic = <<1>>
/\ Buffer = NULL

State 2: Stuttering

And that’s because we forgot to account for stuttering in our spec. Previously, we mentioned that all (valid) TLA+ specs are stutter-invariant meaning they can always stutter between any two “real” actions. In fact, in TLA+ it’s perfectly valid behaviour for a system to stutter forever! That’s what’s happening in this case. After the initial state, the next step and every step after that is that a stutter step. Obviously, if our system never takes a ConsumeProduceCommit step, we’ll never consume any messages and thus our system will never satisfy our definition of Termination.

Another, perhaps more intuitive, way to think about this kind of behaviour is to imagine if our system crashed immediately and thereafter never recovered. In this worst case scenario, other events will continue to happen in the wider universe which are represented by stutter steps. This is the type of situation that TLC is effectively simulating for us with this stuttering-forever behaviour.

While it’s nice that we can simulate this, it’s not usually super useful to assume that a system can crash permanently. To get around this issue, we need to somehow express in TLA+ the idea that that if our Next step is enabled (i.e. it is possible to take a Next step), our system must eventually take that step (rather than stuttering forever). We can do exactly this by adding a fairness constraint to our spec. Before we introduce the syntax for doing this, let’s discuss quickly what exactly we mean by fairness.

By default, TLA+ considers all actions to be unfair, which is to say that there is no guarantee that a given action A will eventually be taken regardless of whether it can happen or not. In contrast, we say an action A is weakly fair if, given that an A step can always happen, then it eventually does happen. These are exactly the semantics we want for our Next step. Let’s see how we can specify our Next step as weakly fair in our spec:

Fairness == WF_vars(Next)  \* 1

Spec ==
    /\ Init
    /\ [][Next]_vars
    /\ Fairness            \* 2

In TLA+, we can specify an action A that changes variable v as weakly fair by writing an expression of the form WF_v(A). Thus, WF_vars(Next) is declaring that the Next action (which changes vars) is weakly fair.
Here, we’re just adding our fairness constraint to our spec as another conjunction. In formal jargon, we say “our Spec is now defined to be our system’s specification with the added condition of weak fairness of the next-state relation which asserts that execution does not stop as long as one of the Next actions is enabled.”

If we check this updated specification formula, TLC should no longer report any temporal property violations 👏👏👏

For a more in-depth treatment of the concept of fairness, I highly recommend reading Hillel Wayne’s excellent Weak and Strong Fairness blogpost. In fact, while you’re at it, just read all of his blogposts about TLA+.

Hopefully, you’re starting to see a pattern by now. Every time we discover a problem in our spec, it’s usually because we’ve made some assumption in our heads but failed to write it down explicitly. TLA+ is not a language that takes anything for granted and it forces us to explicitly write down all of our assumptions.

Checking for Duplicates

So far we’ve checked for some of the usual properties you would expect of most systems; absence of deadlock and eventual termination. One of the key goals of this series however was to model a system that was exactly-once. An important property of exactly-once behaviour is to avoid generating duplicates (as compared to at-least-once behaviour). In our context, this means our OutputTopic should never contain any duplicated elements. We can express this property in TLA+ as follows:

NoDuplicatesIn(seq) ==                                 \* 1
    \A i \in 1..Len(seq):                              \* 2
        \A j \in 1..Len(seq):                          \* 3
            IF i /= j THEN seq[i] /= seq[j] ELSE TRUE  \* 4   

NoDuplicates == NoDuplicatesIn(OutputTopic)            \* 5

Here we’re declaring a new “helper” operator NoDuplicatesIn. As the name implies, the idea here is that we can use this operator to check that any given sequence doesn’t contain any duplicated elements.

As you’ve probably noticed, the syntax defining for operators that can “accept arguments” is very similar to what you might see when you define a function in regular programming languages like Python (except there’s even less syntax/ceremony involved!).
In order to check that each element of a sequence of arbitrary length is distinct, we need to compare each element with each other. We’ll need to use a new operator to do this: \A which is the “for all” operator. In general, an expression of the form \A x \in S : F can be read as “formula F is true for all elements x in the set S.” For example, \A i \in {1, 2}: i ≤ 2 is the same as writing (1 ≤ 2) /\ (2 ≤ 2).
Because we want to compare each element with every other element, we have to “iterate” over the sequence once for each element in the sequence (i.e. this is an O(N²) operation for those more comfortable with Big O notation).
We have a couple of new operators here. First, we have /= which is the not-equals operator i.e. opposite of =. Secondly we have IF … THEN … ELSE … which should be self-explanatory as it operates just like it’s counterparts in any other programming language.

Thus, we can read the whole expression as “for all elements i and j in the set of numbers from 1 (inclusive) to Len(seq) (inclusive), where i is not equal to j, the element at index i of sequence seq is not equal to the element at index j of sequence seq.”
We can then use our NoDuplicatesIn helper operator to write the definition of our NoDuplicates property which asserts that there should be no duplicates in our OutputTopic sequence.

We can check that this property holds by adding it to our config file:

INVARIANT NoDuplicates  \* 1

We use the INVARIANT keyword to declare any properties that we want to check are true in every state of our system. These types of properties are known as safety properties. In contrast, the PROPERTIES keyword is used to declare any properties that we want to check are true eventually. These are known as liveness properties.

The question of whether to write our duplicates check as a safety property versus a liveness property is a design decision. Are we writing a system that never contains duplicates ever or a system that eventually doesn’t contain any duplicates? For our purposes, let’s assume we’re designing a system that never contains any duplicates ever. Hence our decision to use the INVARIANT keyword rather than PROPERTIES.

If we run TLC with this updated configuration, you should see TLC passes the spec without issues i.e. our system design is robust against duplicates.

Checking for Missing Data

An example of missing data: messages 4 and 5 are missing

We’ve covered the first half of exactly-once by checking for a lack of duplicates. The other half of exactly-once is making sure no data is missing (otherwise our system would be considered at-most-once). In our context, this means checking that each element in InputTopic is present in our OutputTopic. We can express this property in TLA+ as follows:

SeqContains(seq, value) ==                                      \* 1
    \E x \in 1..Len(seq): 
        seq[x] = value                                          \* 2

SeqContainsAll(seq, values) ==
    \A i \in 1..Len(values):
        SeqContains(seq, values[i])                             \* 3

NoMissingData == <>[](SeqContainsAll(OutputTopic, InputTopic))  \* 4

Here we’re starting the process of defining a new “helper” operator SeqContains. As the name implies, the idea here is that we can use this operator to check if a given sequence contains a given value.
In order to “check” that at least one element of a sequence satisfies a condition, we need to use a new operator; \E formally referred to as existential quantification and generally read as “there exists.” In general, an expression of the form \E x \in S : F can be read as “there exists an element x in set S for which formula F is true.”

Thus, we can read our expression \E x \in 1..Len(seq): seq[x] = value as “there exists an element x in the set of numbers from 1 (inclusive) to Len(seq) (inclusive) where the element at index x of sequence seq is equal to value.” More informally, this expression is true as along as one of the elements in the given sequence is equal to the given value.
Using our SeqContains operator, we can define another helper operator SeqContainsAll that asserts that all elements of values are present in seq using the “for all” operator \A.
Last but not least, we can define our property NoMissingData as <>[](SeqContainsAll(OutputTopic, InputTopic)) or in other words “eventually our output topic will contain all elements of input topic always.”

We can declare this property in our config file like so:

PROPERTY NoMissingData  \* 1

Notice that we have declared this a Property rather than an Invariant because NoMissingData is only guaranteed to be true eventually. It is not guaranteed to be true in every state e.g. it definitely doesn’t hold in our initial state when the output topic is empty! However, if our system behaves correctly, the output topic should eventually contain all the elements of our input topic.

If we run TLC with this updated configuration, you should see TLC passes the spec without issues i.e. our system design is robust to missing data issues.

Refactoring

Before we conclude this post, let’s take a brief moment to “clean” up some parts of our spec.

Model Values

Our solution for assigning a value to NULL was to make it equal to 0, a value that our messages/offsets can never be equal to in our TLA+ spec. This is a rather brittle solution. If we were to reuse our definition of NULL in a different numerical context, there is no guarantee that 0 will be an appropriate value to represent missing data. Indeed, 0 might even be a legitimate value in that context!

To avoid this issue, it’s common for NULL to be defined using another data type that better fits the use case; model values. Model values are a special type of value that you cannot do anything with other than to check for equality (=) with another value. Furthermore, a model value is only ever equal to itself. Despite these restrictions, model values can still be quite useful and it’s common to use them for modelling the concept of null.

In TLC, we can assign a model value like so:

CONSTANT NULL = NULL  \* 1

Here we’re declaring that the CONSTANT NULL in our spec is equal to a unique model value NULL. Note that we could name the model value anything really e.g. CONSTANT NULL = abc.

If you try running our spec through TLC after this change, it should continue to pass without issues.

It’s worth noting that model values are a construct available exclusively in TLC. TLA+ itself has no concept of model values. The only way to “define” a model value is to declare a CONSTANT parameter in your TLA+ spec and then assign a model value to that parameter in your TLC config, as we did above.

For-All and Implication Operators

Our definition of NoDuplicatesIn currently looks like this:

NoDuplicatesIn(seq) ==                               
    \A i \in 1..Len(seq):                            
        \A j \in 1..Len(seq):                        
            IF i /= j THEN seq[i] /= seq[j] ELSE TRUE

We can simplify our definition of NoDuplicatesIn in two ways:

NoDuplicatesIn(seq) ==
    \A i, j \in 1..Len(seq):        /* 1
        i /= j => seq[i] /= seq[j]  /* 2

First, we can switch to using a slightly specialized form of the “for all” operator which allows us to assert that formula F is true for all possible pairs of elements in the set S using the \A i, j \in S: F syntax.

For example, \A i, j \in {1, 2}: i + j ≤ 4 is the same as saying (1 + 1 ≤ 4) /\ (1 + 2 ≤ 4) /\ (2 + 1) ≤ 4 /\ (2 + 2 ≤ 4).
The second simplification we can make to our definition of NoDuplicatesIn is to use the implication operator => to replace IF i /= j THEN seq[i] /= seq[j] ELSE TRUE with the shorter but otherwise completely equivalent i /= j => seq[i] /= seq[j] expression. In general, we read an expression of the form F => G as “F implies G.” F => G is true if either:
- F is true and G is true or
- F is false

If you try running our spec through TLC after these changes, it should continue to pass without issues.

Final Spec

After our refactor (plus a little rearrangement to organize everything a little more sanely) we end up with the following TLA+ spec:

---- MODULE ExactlyOnce ----

EXTENDS Sequences, Integers

CONSTANT NULL

VARIABLES InputTopic, Buffer, OutputTopic, CommittedConsumerOffset

vars == <<InputTopic, Buffer, OutputTopic, CommittedConsumerOffset>>

Init ==
    /\ InputTopic = <<1>>
    /\ Buffer = NULL
    /\ OutputTopic = <<>>
    /\ CommittedConsumerOffset = 1

MessagesAvailable == CommittedConsumerOffset < Len(InputTopic) + 1

ConsumeProduceCommit ==
    /\ MessagesAvailable
    /\ Buffer' = InputTopic[CommittedConsumerOffset]
    /\ OutputTopic' = Append(OutputTopic, Buffer')
    /\ CommittedConsumerOffset' = Buffer' + 1
    /\ UNCHANGED InputTopic

Done ==
    /\ ~MessagesAvailable
    /\ UNCHANGED vars

Next ==
    \/ ConsumeProduceCommit
    \/ Done

Fairness == WF_vars(Next)

Spec ==
    /\ Init
    /\ [][Next]_vars
    /\ Fairness

Termination == <>[](~MessagesAvailable)

NoDuplicatesIn(seq) ==
    \A i, j \in 1..Len(seq):
        i /= j => seq[i] /= seq[j]

NoDuplicates == NoDuplicatesIn(OutputTopic)

SeqContains(seq, value) ==
    \E x \in 1..Len(seq):
        seq[x] = value

SeqContainsAll(seq, values) ==
    \A i \in 1..Len(values):
        SeqContains(seq, values[i])

NoMissingData == <>[](SeqContainsAll(OutputTopic, InputTopic))

====

And the following TLC config file:

SPECIFICATION Spec
CONSTANT NULL = NULL
INVARIANT NoDuplicates
PROPERTY Termination NoMissingData

Running this through TLC, our spec should come back all clear.

Reflections

We wrote an initial spec. We added some checks. We found bugs in our spec. We corrected our spec. More checks. More bugs. More corrections. And so on in a feedback loop not unlike your regular programming experience with unit testing, except here we’re getting feedback on our designs.

Moreover, in less than 50 lines we’ve expressed a model of our system and checked that it is exactly-once. But is it really this easy to write an exactly-once replicator? Are we sure our model is an accurate representation of reality? Cue:

All models are wrong, but some models are useful.

In Part 3 Modelling Failures and Transactions, we’ll take a critical look at our specification and question some of the (unrealistic) assumptions we’ve accidentally embedded in our design.

In the meantime, if you want to learn more on your own, try the following exercises:

Currently our TLA+ spec just declares a constant NULL without any further constraints or information. It says nothing about what are valid values for NULL. How can you prevent users from assigning incorrect values for NULL? Hint
How would you change the spec if we decided that our system needs to be a streaming/always-on system rather than a system that terminates once all messages have been processed? What needs to change, if anything?
Take a look at the spec. How many states does our system have? Do they reflect all the possible, relevant states and state transitions for our system?
What happens as you (slowly!) increase the number of messages in the input topic? How long does TLC take to verify the spec as the number of messages increase?
Imagine if our spec permitted multiple messages in the input topic. In that case, what other properties might you want to check? How would you write a property that asserts that messages are produced to the output topic in the same order as the input topic?

An Important Reminder

Since part 1 of this series:

We have passed the 300 day mark of the ongoing genocide in Gaza. During that time, official statistics record at least 39,653 killed (including more than 15,000 children), 91,535 injured, and 10,000 missing.
B’Tselem (The Israeli Information Center for Human Rights in the Occupied Territories) published a new 118 page report titled “Welcome to Hell” which describes a network of “torture camps” where 9,623 Palestinians were held as of July 2024, 4,781 of whom were detained “without trial, without being presented with the allegations against them, and without access to the right to defend themselves.” Read the full report here. The following is a quote straight from the report:

“Their testimonies uncover a systemic, institutional policy focused on the continual abuse and torture of all Palestinian prisoners. This includes frequent acts of severe, arbitrary violence; sexual assault; humiliation and degradation; deliberate starvation; forced unhygienic conditions; sleep deprivation; prohibition on, and punitive measures for, religious worship; confiscation of all communal and personal belongings; and denial of adequate medical treatment.”
In similar news, the UN Human Rights Office also published a report on “arbitrary, prolonged and incommunicado detention by Israeli authorities, affecting thousands of Palestinians since last October.” Read the full report here.

Don’t be a bystander to genocide.

🇵🇸🇵🇸🇵🇸