Modelling Exactly-Once Using TLA+ Part 1: Writing a Specification

Farooq Qaiser
15 min readJul 31, 2024

--

TLA+ is a language for writing system specifications and discovering fundamental design errors quickly. In this series of blogposts, we’ll introduce TLA+ by iteratively modelling and refining our design for a simple application that can read and write messages from/to Kafka in an exactly-once fashion. This will be a four-part series consisting of:

  1. Writing a Specification
  2. Checking a Specification
  3. Modelling Failures and Transactions
  4. Modelling Zombies and Zombie Fencing (coming soon)

You can find all of the code for this series in this GitHub repo.

Also, if you have interesting systems engineering problems and are looking for a strong Software Engineer, feel free to reach out via LinkedIn.

Introduction

Before we can start writing TLA+, we need to first understand the system we’re trying to model. For the purposes of this series, we’ll be modelling a simple process that continuously executes the following steps in a loop:

  1. Consume the next available message from an input topic based on the last committed consumer offset
  2. Produce the consumed message to an output topic
  3. Commit the consumer offset for the next message to be consumed in case of failure
Figure 1: Consume-produce-commit loop

Essentially, our system is just replicating messages from one topic to another. Where things get interesting is we want to make sure we replicate each message from our input topic to our output topic exactly once. If you’ve ever written a similar application using the Java KafkaConsumer and KafkaProducer APIs, you’ll know it’s easy to produce duplicates. Even worse, if you’re really not being careful, you could accidentally lose data!

We’ll talk more about exactly-once in parts 3 and 4. To avoid disappointment though, advanced readers should note that we will not be modelling some consensus algorithm like Paxos or Raft in this series. We’ll still encounter the same problems but our solution/algorithm for exactly-once will be fairly high level. This should still be an interesting exercise whilst remaining a beginner friendly introduction to TLA+ for most readers.

For part 1 however, we’re going to simply focus on getting familiar with TLA+ syntax by writing a simple specification of our system as we’ve described it.

TLA+ Specifications

TLA+ is primarily a language for writing system specifications i.e. a description of all the allowed behaviours a system may take during the course of an execution. Behaviours are represented as an infinite sequence of discrete steps, where a step is a transition from one state to the next.

Figure 2: Two different behaviours of the same system

And a state is simply the assignment of values to variables at a specific point in execution.

Figure 3: A behaviour describing how variable `a` changes between states

As a result, the first step in writing a TLA+ spec typically is to figure out what are the relevant variables in the system.

Variables

If we look carefully at Figure 1, we can see the only things that are changing in our system are the:

  1. Input topic
  2. Message buffer
  3. Output topic
  4. Committed consumer offset

These are the variables in our system. Based on this, we can start writing our first few lines of TLA+.

The first thing we need to do is create a new file ExactlyOnce.tla. Inside this file, we can declare a new TLA+ module like so:

---- MODULE ExactlyOnce ----  \* 1

==== \* 2
  1. TLA+ modules are always begun with a MODULE ${filename} declaration surrounded by 4 (or more) dashes on either side. You can think of it as kind of like a header signalling the start of a document.

    We also have the \* XXX syntax at the end of the line which is used to write in-line comments in TLA+. We’ll be using it throughout the series to indicate which line of “code” each bullet point is addressing.
  2. Similarly, TLA+ modules are always ended with 4 (or more) equals signs. You can think of this as kind of like a footer signalling the end of a document.

In between these two lines, we can write the actual specification of our system. We’ll start by declaring the variables we just discovered:

VARIABLES InputTopic, Buffer, OutputTopic, CommittedConsumerOffset  \* 1
  1. We use the VARIABLES keyword to declare any variables in our system as a comma separated list.

Using these variables, we should be able to describe any behaviour of our system in terms of how it changes (or doesn’t change!) these variables. Our next step is to specify the initial state of our system using these variables.

Initial State Predicate

We can model the initial state of our system as though our application has not yet been launched. In this state, our output topic will be empty and the message buffer will be missing. Technically speaking, the committed consumer offset would also be missing if the application hadn’t launched yet but for simplicity, we’ll have it point to the offset of the first message in the input topic. For similar reasons, we’ll assume our input topic will only ever contain one message. Lastly, we’ll assume both our topics are singly partitioned and thus avoid the need to model the concept of partitions entirely.

Figure 4: Initial state

These kinds of simplifications and assumptions are a normal part of the process of building models. A model is an abstraction of reality; we can include as much or as little detail as we want depending on how we intend to use the model. In our case, our ultimate goal is to model exactly-once behaviour. If we can’t achieve exactly-once in the simplest of cases (replicating a single partition containing only one message), then it’s hardly likely we’ll be able to figure it out for more complex cases. Like most things in life, it’s best to start simple and gradually build up from there.

TLA+ is based on the idea that the best way to describe things precisely is with math. Let’s see what the initial state of our system looks like in math:

CONSTANT NULL                       \* 1

Init ==
\* 2
/\ InputTopic = <<1>>
\* 3
/\ Buffer = NULL
\* 4
/\ OutputTopic = <<>>
\* 5
/\ CommittedConsumerOffset = 1
\* 6
  1. TLA+ doesn’t have any built-in concept of null so our first order of business is to declare a constant NULL. We’ll be using this shortly to represent missing data in our spec.
  2. In TLA+, we use == to define things. Here we’re using it to define a new operator named Init which will describe the initial state of our system.
  3. There are a bunch of symbols here so let’s try to understand what they each mean individually before putting them together.

    First we have the conjunction operator /\. This is used to combine boolean expressions AND wise. We’ll see later we can also combine boolean expressions OR wise using the disjunction operator: \/. (An easy way to remember which operator is which is by pretending /\ looks like the capital letter A, as in A for AND). We’ll be writing our definition of Init as a conjunction of 4 distinct expressions.

    Next, we have the angular brackets used for representing sequences:
    - <<>> represents an empty sequence.
    - <<1>> represents a sequence of one element; 1.
    - <<1, "a">> represents a sequence of two elements; 1 and "a".
    - And so on.

    For now, it’s sufficient to think of sequences as similar to ordered collections in regular programming languages e.g. the Tuple data type in Python. (In reality, TLA+ doesn’t really have a concept of sequences, they are under the hood just functions but don’t worry about that for this series). This makes sequences a natural fit for modelling Kafka topics, where each element of the sequence is a Kafka message.

    Last but not least, we have = which is the standard math equality operator you remember from primary school e.g. 1 = 1.

    Putting all of this together, you should now be able to read the first expression in our conjunction. Just like how in regular math we would read x + y = 1 as “x plus y equals 1,” we can similarly read InputTopic = <<1>> as “InputTopic is equal to a sequence containing 1” in the initial state.

    What’s the significance of the number 1 in our expression? We’ve chosen to represent our messages simply by their offset position in the input topic. So 1 represents the first (and only) message in the input topic. We could have modelled our messages as a struct type however for our purposes this will be more than sufficient and in fact it will make it easy for us to detect duplicates later (as Kafka offsets are unique within a topic-partition). This is another case of us simply choosing the level of detail we want to include in our model.

    Eagle-eyed readers might be surprised to see that the offset of the first message is 1 when, in reality, Kafka offsets start from 0. The problem here is that TLA+ sequences happen to be 1-indexed (R developers rejoice!). Whilst there are ways we could model this in TLA+ more accurately, for the purposes of our specification it does not matter whether our offsets start at 0 or 1, so we won’t waste our time trying to fix this unimportant detail.
  4. The next expression in our conjunction is Buffer = NULL which can be read as the Buffer is equal to NULL (in the initial state).
  5. The next expression OutputTopic = <<>> can be read as the OutputTopic is equal to an empty sequence (in the initial state).
  6. The final expression CommittedConsumerOffset = 1 in our conjunction of expressions states that the committed consumer offset is equal to 1 (in the initial state) to indicate that our process should start reading messages from the beginning of the topic.

What we’ve just written here is an example of an initial state predicate i.e. a boolean valued formula that specifies all the possible initial values of the variables in our system. You should be able to see clearly how each of the various TLA+ expressions we just defined map to our earlier description of the initial state. Even though we say TLA+ is based on math, it’s mostly simple boolean logic and set theory.

However, a system wouldn’t be of much use if all it did was remain in one state. Most systems will have multiple states that they will transition between. Our next step will be to specify these next potential states in our TLA+ spec, again using “math.”

Next-State Action

In TLA+, we will typically write a next-state action to specify the potential next steps a system can take. Just like we can define state predicates as boolean valued operators, we can also define actions as boolean valued operators in TLA+. The only difference is that while state predicates describe states, actions describe steps. (You can also just think of actions as a predicate over a pair of states given that a step is just a transition from one state to the next). In our case, a first attempt at writing such a next-state action operator might look something like this:

EXTENDS Sequences, Integers                                         \* 1

MessagesAvailable == CommittedConsumerOffset < Len(InputTopic) + 1 \* 2

ConsumeProduceCommit == \* 3
/\ MessagesAvailable \* 4
/\ Buffer' = InputTopic[CommittedConsumerOffset] \* 5
/\ OutputTopic' = Append(OutputTopic, Buffer') \* 6
/\ CommittedConsumerOffset' = Buffer' + 1 \* 7
/\ UNCHANGED InputTopic \* 8

Next == ConsumeProduceCommit \* 9
  1. The statement EXTENDS Sequences, Integers extends the
    scope of operators like Append defined in the Sequences module to our ExactlyOnce module. You can think of this as similar to import statements in Python.
  2. We’ve defined MessagesAvailable as an expression which is true if our committed consumer offset is less than one plus the number of messages in our input topic i.e. if there are still messages in our input topic that our process hasn’t yet replicated successfully. We’ll be using this definition in a couple of places in our spec.
  3. Similarly to how we defined Init, our ConsumeProduceCommit operator is also defined as a conjunction of several expressions.
  4. The first conjoined expression in our conjunction MessagesAvailable specifies that our system can only take this step if there are messages available to be consumed.
  5. There are a few new symbols here so let’s try to understand those first, starting with Buffer'.

    Recall that a step in TLA+ refers to a pair of consecutive states. Therefore, in order to define a step, we need to be able to refer the values of a variable in both states. We can do this by making use of unprimed and primed variables. Buffer is an example of an unprimed variable and it refers to the value of the Buffer variable in the first state of a step. In contrast, Buffer' (read as “Buffer prime”) refers to the value of the Buffer variable in the second state of a step. Thus when we see an expression like Buffer' = 1 , we can read this as “the value of Buffer in the next state is equal to 1.” In this way, we are able to define steps.

    The last new piece of syntax here is: InputTopic[CommittedConsumerOffset]. Square brackets allow us to “index” a TLA+ sequence in a similar way to how we can use square brackets in Python to index a Tuple. So for example, if we have a sequence Seq == <<"a", "b">>, then Seq[1] = "a" and Seq[2] = "b".

    Thus we can read Buffer' = InputTopic[CommittedConsumerOffset] as “the value of Buffer in the next state is equal to the value at the CommittedConsumerOffset index in the InputTopic sequence.” Essentially, this expression is simulating the process of consuming a message from our input topic and storing the message in a local buffer.
  6. OutputTopic' = Append(OutputTopic, Buffer') can be read as “the value of OutputTopic in the next state is equal to the OutputTopic in the first state with the message stored in the buffer appended to it.” Essentially, this expression is simulating the process of producing the message stored in our local buffer to the output topic.
  7. CommittedConsumerOffset' = Buffer' + 1 can be read as “the value of CommittedConsumerOffset in the next state is equal to the offset of the message in the buffer plus one.” Essentially, this expression is simulating the process of committing our new consumer offset.
  8. TLA+ is all about defining your system explicitly. Almost nothing about a system’s behaviours is left unspecified or implicit in a TLA+ spec. As a result, we also have to declare any variables that are left unchanged in a step. This is usually done by using the UNCHANGED keyword followed by the relevant variable/s.

    We could also have chosen to explicitly write out the unchanged variables e.g. InputTopic' = InputTopic which should read intuitively as “the value of InputTopic in the second state of this step is equal to the value of InputTopic in the first state of this step.” In general though, you’ll see most people prefer to use the UNCHANGED keyword.
  9. Finally, we define a Next operator as simply equivalent to the ConsumeProduceCommit operator we just defined. This is by no means necessary but the general convention in TLA+ is to name the next-state action formula Next. Of course, we could have also just renamed our ConsumeProduceCommit operator to Next but as we’ll see later on in the series, our definition of Next is far from complete.

Hopefully you can see how we’ve translated the consume-produce-commit loop we illustrated earlier in Figure 1 into precise TLA+ expressions.

At this point, we have formulas defining both the initial state predicate and next-state action. Obviously a system has to transition from an initial state to the next state but how do we express this formally in TLA+?

Spec

In TLA+, we will typically define a Spec operator to bring together our initial predicate and next-state action formulas as a temporal formula that describes all the behaviours a system can exhibit. Let’s see what that looks like for our system:

vars == <<InputTopic, Buffer, OutputTopic, CommittedConsumerOffset>>  \* 1

Spec ==
\* 2
/\ Init
\* 3
/\ [][Next]_vars
\* 4
  1. First, we’ve defined an operator vars that, in each state, is equal to a sequence of the values of all our variables. We’ll be using this definition of vars shortly.
  2. Next we’re defining a Spec operator as the conjunction of two separate expressions.
  3. The first expression simply states that a valid behaviour is one that starts in conditions matching our initial state predicate exactly.
  4. The second expression in our conjunction involves our next-state action formula and once again we have a bunch of new symbols that deserve some explanation.

    Let’s start with the square brackets at the beginning of the expression: []. This temporal-logic operator is formally referred to as the box operator and is generally used to express that a formula is always true . For example, []F asserts that the formula F is always true. In our case, [][Next]_vars asserts that [Next]_vars is true for every step of every behaviour.

    Naturally then the next thing we need to understand is what [Next]_vars means. [Next]_vars is actually short-hand/syntax-sugar for expressing Next \/ (vars' = vars). If we look at this expanded notation, what this is saying is either the system has to take one of the steps defined by Next OR it can take what is known as a stuttering step where no variables change. You’re probably wondering at this point, why would we want to write a system that can take steps in which nothing changes i.e. a system that can occasionally do nothing?

    The reason we have to specify our system like this is because TLA+ only allows temporal formulas that are invariant under stuttering i.e. temporal formulas must allow for stuttering steps to occur. This might seem strange at first glance but the underlying philosophy here is that TLA+ specifications describe more than just our system but rather an entire universe, of which the system we’re describing is just one part. In that universe, many events can take place besides those corresponding to your system and during those steps, your system’s variables will not change. Hence any sensible specification needs to allow stuttering steps. For a more in-depth explanation, you can read this article by Leslie Lamport the original author of TLA+.

    For these reasons, TLA+ syntax rules make it illegal to write a temporal formula as simply []A where A is an arbitrary action that does not allow stuttering. When using the [] temporal operator, A must allow stuttering steps and thus we usually write [][A]_v where the sub-expression [A]_v represents an A step or a stuttering step (and v represents all the variables that can change in your system).

Syntax-intricacies aside, we say Spec is defined as the conjunction of the initial state and the next state, for all values of vars. This formula asserts that a valid behaviour of a system is one that satisfies its initial state Init, and each of its steps satisfies Next or are stuttering steps. Thus, we’ve now formally defined all the behaviours our system can exhibit as a single formula, Spec.

Reflections

With a little rearrangement, we end up with the following spec:

---- MODULE ExactlyOnce ----

EXTENDS Sequences, Integers

CONSTANT NULL

VARIABLES InputTopic, Buffer, OutputTopic, CommittedConsumerOffset

vars == <<InputTopic, Buffer, OutputTopic, CommittedConsumerOffset>>

Init ==
/\ InputTopic = <<1>>
/\ Buffer = NULL
/\ OutputTopic = <<>>
/\ CommittedConsumerOffset = 1

MessagesAvailable == CommittedConsumerOffset < Len(InputTopic) + 1

ConsumeProduceCommit ==
/\ MessagesAvailable
/\ Buffer' = InputTopic[CommittedConsumerOffset]
/\ OutputTopic' = Append(OutputTopic, Buffer')
/\ CommittedConsumerOffset' = Buffer' + 1
/\ UNCHANGED InputTopic

Next == ConsumeProduceCommit

Spec ==
/\ Init
/\ [][Next]_vars

====

Notice how we call this a spec, as opposed to a program or code. Although we drew some parallels with programming, it’s important to remember that TLA+ is not a programming language. There’s no code or program here that can be run. All we’ve done is write down a formal description of our system.

While it’s nice that we can describe systems in a more precise and concise language than plain English, how do we know what we’ve written is correct? Just because we’re writing something in “math” doesn’t magically make it correct e.g. writing 1 + 1 = 3 doesn’t make it a true statement. So how do we know our spec is correct? And what does it mean for a spec to be correct? To find out the answers to these questions and more, see Part 2 - Checking a Specification.

In the meantime, if you want to learn more on your own, try:

  1. Extending the spec so that our system can read multiple messages at a time.
  2. Thinking about what our system should do once it’s consumed all the messages in the input topic. What does our spec say about this, if anything?

An Important Reminder

Over the last month, we have seen:

  1. The International Court of Justice established that the occupation of Palestinian territory is unlawful. For more information, you can read the ICJ’s official 80 page advisory opinion here (the last 3 pages are particularly noteworthy) or watch the recording of the proceedings on YouTube. The following is a quote straight from the ICJ document:

    “155. In light of the above, the Court reaffirms that the Israeli settlements in the West Bank and East Jerusalem, and the régime associated with them, have been established and are being maintained in violation of international law …”
  2. Multiple U.S. government officials who have resigned over U.S. policy towards Gaza, Palestine, and Israel released a joint statement which I highly encourage you to read here (it’s only 5 pages) or alternatively you can watch the recording of the related webinar on YouTube. The following is a quote straight from the joint statement:

    “Both our individual and common experiences demonstrate an Administration that has prioritized politics over just and fair policymaking; profit over national security; falsehoods over facts; directives over debate; ideology over experience, and special interest over the equal enforcement of the law.”
  3. And of course, during that time Israel has continued it’s bombing of UN-run schools and designated “safe zones.”

Don’t be a bystander to genocide.

🇵🇸🇵🇸🇵🇸

--

--