Modelling the archetype of a message-passing bug with TLA+

Gregory Terzian
4 min readJun 15, 2022

--

Two years ago, I tried to describe “the archetype of a message-passing bug”, using both a real-world example in Rust, as well as plain English.

Now, having written some TLA+ recently, I propose we go through this bug, and its fix, once more, and this time use math to model it…

Quick recap

The original bug is best described by looking at a scheme provided in the discussion on Hacker News, by a contributor known as ivanbakel.

So basically, there were three different message queues, and the buggy code had been written with the implicit expectations that there would be some guaranteed order between sending a message on two queues from two different threads of executions.

The solution was to send those two messages from the same thread of execution, on the same channel.

So how can we go about modelling the bug, and the fix, with TLA+?

Modelling the bug

Let’s start with the spec as a whole, and then dissect it.

So what are we looking at?

In terms of variables, we have:

  1. A Messages set, consisting of the three messages “First”, “Second”, and “Third”.
  2. Three queues, initially empty, of the type Seq(Messages).
  3. A few other variables used for keeping track of whether we’ve sent some initial messages, and whether the second and third message have been received.

We then have five operators, let’s go through each of them:

  1. SendMessages sends message “First” on queue_a, and message “Third” on queue_c. This comes with an implicit expectation(the bug), that “Third” will indeed be received as, well, third.
  2. ReceiveQueueA receives the “First” message on queue_a, and also sends the “Second” message on queue_b.
  3. ReceiveQueueB receives the “Second” message on queue_b, with the requirement that the “Third” message hasn’t been received yet.
  4. ReceiveQueueC receives the “Third” message on queue_c.
  5. Done represents the requirement that all messages are eventually received.

Note that for Done to be enabled, both “Second” and “Third” would have to have been received, and also note that for this to happen, “Second” would have to be received first(see the requirement for received_third to be FALSE within ReceiveQueueB).

How would this match up with a real system?

Well, in the initial bug, SendMessages would be essentially component A sending two messages, one to component B, and another to component C. Component B, when receiving that message, would handle it and also send another message to component C. That component C would then say be “selecting” on two channels, one connected with component A, and another one with component B.

And the code would then have to contain a faulty assumption that somehow the message sent by component B would be handled by component C before the second message sent by component A.

Now first of all, just looking at the above spec, the problem should be much easier to discern than if looking at the code. That is because the code is mostly concerned with business logic that is unrelated to sending and receiving messages.

Just writing the above spec, you’d probably start asking yourself why ReceiveQueueB is only enabled when received_third = FALSE.

Secondly, we can actually run this with the model checker, which will also show that the above spec is not logically correct.

For that we must only choose a value for the QLen constant, which represent the length of the queues, and can be set to say 1.

The deadlock reached when running the model checker

The problem shows up in a deadlock, in this case at state number 4. At that point, a state has been reached where “Third” has been already received via queue_c, while “Second” is now stuck in queue_b, because receiving that message requires “Third” to not have been received yet.

So how do we fix this?

Modelling the fix

The above spec is very similar to the faulty one. The differences are:

  1. We get rid of queue_c, keeping only two queues.
  2. When receiving “First” on queue_a, we append both “Second” and “Third”, to queue_b.
  3. QLen is assumed to be above 1, and in the model we run it with the value 2, so that both messages can be appended to it as described above.

Et voilà!

Wrapping up

What is the moral of the story? When just writing code, no matter how much experience you have with concurrency, you will overlook some subtle, and some not so subtle, concurrency issues.

That is mainly because the concurrency in the code is surrounded by additional business logic. It’s simply a question of not seeing the forest anymore…

TLA+ really shines as a tool to design the concurrent aspect of parts of your system, and get clarity of thinking around what you are trying to achieve, mainly because when writing a spec, you focus only on the concurrency aspect.

Thank you for reading.

--

--

Gregory Terzian

I write in .js, .py, .rs, .tla, and English. Always for people to read