My first TLA+ spec

Gregory Terzian
4 min readMay 19, 2022

A case of spec-driven debugging.

I recently ran into a concurrency bug. It took me a bit of reading code and thinking about the system, but then I finally found it. Instead of “just fixing it” — I had been reading “the book” for a few weeks — I decided to finally take a shot at TLA+ to model both the failure and the fix. This is the story of that first TLA+ spec.

First of all, why TLA+?

These days, I write mostly in Rust, and work on concurrent and distributed systems.

I became aware of TLA+ in late 2019, briefly looking into it at the time. For some reason I didn’t quite get into it — perhaps I was thinking too much into “how to write concurrent code”, as opposed to “what the system is supposed to do” — and I kept it in the back of my head as something to look more into later.

Enter the 2022 Shanghai lock-down. Suddenly, I had a bit of extra time. I also happened to have printed out the first 7 chapters of the book just before the start of the lock-down. And this time, I really got into it.

So, what about that first spec?

The setup

Imagine a system, working a bit like a pipeline consisting of three steps:

  1. The input: where one component handles some incoming data, does some initial processing, and then sends an event on a queue.
  2. The processor: a component that will receive the event sent at 1, and do the actual heavy lifting in terms of processing, and then send the result on yet another queue.
  3. The output: where the output from 2 is further handled downstream.

Initially, the data would flow through the system as it was processed. In Rust parlance, it was moved from one component to another. Later, the system was refactored to introduce a form of shared data. From then on, events flowing through the system would not include the object themselves, but only their identifier — enabling components to look for the actual object inside a piece of data shared among them.

The bug

In production, Step 2 started to fail occasionally with “missing object” errors, because it would receive an event from Step 1, yet the matching object in the shared data was nowhere to be found.

The problem was the following sequence of steps:

A. Step 1 would process input, add it to the shared data, and send an event to Step 2 containing an identifier.

B. Step 2 would prune the shared data, and remove the object that was added above at 1.

C. Step 2 then received the event from step 1, and cannot find the object in the shared data.

The problem was a race condition between Step 1 adding the object to the shared data and sending an event to Step 2, and Step 2 pruning the object before handling that event. A classic problem of having multiple writers to a piece of shared data, I wrote about this type of problem two years before — yet this didn’t prevent me from overlooking this edge case in my current design. Instead of fixing the code right then and there, I decided to make use of my newfound knowledge of TLA+.

Modelling the failure

SendIncoming(id) in the above spec is modelling the Step 1 component, where an event is put on a queue, and the object is added to shared_storage.

ReceiveIncoming then models the Step 2 component receiving the event, and adding it to it’s internal data with status “waiting”. The shared_storage[new] = “Occupied” statement models the requirement that the object whose identifier is received via the event is present in the shared storage, which will be the source of failure for this model.

ProcessWaiting(id) models the Step 2 component as processing the object. This is done in a separate step from receiving it because in the actual system objects are not always immediately processed — sometimes they need to wait for some additional data. This step also adds the identifier to the “outgoing” queue, which is a way of communicating with the Step 3 component.

PruneObject(id) models the Step 2 component pruning the shared storage. This step does not prune objects that will be needed by the Step 3 component — it will only do the pruning if the “outgoing” queue is empty.

ReadOutgoing models the Step 3 component, reading the Id on the queue, and looking up the object in shared storage.

In the TLA model checker, the spec fails with a deadlock — eventually the incoming queue fills up, while the shared_storage[new] = “Occupied” requirement for the ReceiveIncoming step cannot be met. In the real system, the Step 2 component would receive the event, but be unable to find the object in shared storage.

Modelling the fix

The fix consist of avoiding more then one concurrent step writing to shared_storage.

Te SendIncoming(id) step should only put the identifier on the queue, and only the ReceiveIncoming step should add the object to, and eventually prune from, the shared storage.

--

--

Gregory Terzian

I write in .js, .py, .rs, .tla, and English. Always for people to read