TLA+ for startups (part 2)

The elevator pitch

Published in

Koodoo

14 min readDec 5, 2019

This is the second in a series of articles on why the use of formal specification languages — and specifically the language TLA+ — isn’t a completely mad idea for startups. In fact, I made the case in the first article that they could actually prove to be a match made in heaven.

You may agree or disagree with my claim at your own discretion. But I will plough on and start working through a problem using TLA+, and we’ll see if it helps or hinders.

Caveat emptor

Let’s clear one thing up from the start: I’m not going to attempt to teach you about the theory of TLA+ or how to use it. I’m also not going to make any further attempts to sell its merits to you, beyond what I wrote in the first article.

At the time of writing, I have learned what I consider to be the minimum amount necessary to roll my sleeves up and make a start. I have read Hillel Wayne’s book, Practical TLA+, I have worked through Leslie Lamport’s Video Tutorials, I have installed the relevant tools, and I have watched and read a few interesting blog posts and conference talks.

I have a background in Computer Science and a reasonable head for mathematics, but I think any logically-minded person could get to the same stage as I am currently at, with a little application.

From this point on, I’ll dive straight into describing the problem domain and attempting to model it in TLA+. I’ll talk you through my thought processes as I go along, and will add the occasional explainer for certain important constructs, but otherwise will assume you know what they mean.

The Elevator Pitch

Our problem domain will be elevator control software. Over on this side of the pond we call them lifts.

Why do I think this is an interesting problem to model?

Everyone reading this, with the possible exception of claustrophobes, will have used an elevator. Even the claustrophobes will understand the basics of how an elevator operates.
At its simplest, the control software for elevators should be pretty straightforward to model, so we can start with some easy but valid examples (e.g. 1 car, 2 storeys).
However, the problem gets exponentially more complex as you add more cars, more storeys and more safety features. This should allow us to validate that TLA+ is indeed capable of modelling these more complex scenarios without brain-melting amounts of extra effort.
They are a nice example of a finite state machine, which TLA+ is ostensibly good at modelling.
We can make suitable abstractions without compromising the usefulness of the solution. For example, we can model the logic about how calls to the elevator are received and serviced, without having to worry about the mechanics of how the elevator actually moves — gears, pulleys, power, door mechanisms, etc. We can reasonably assume that someone else has got all that covered.

Agreed? OK, let’s get stuck in.

The entry-level elevator

Leaving aside the simplest possible elevators, such as manually operated ones or stairlifts, the entry-level elevator that would benefit from control software would be one that:

Has a single car
Travels between just two floors/storeys (I’ll use the words interchangeably, sorry!)
You can press a button from either floor to call the car (and even if it’s in use, it remembers your call request)
Once inside the car, you just press a ‘go’ button and don’t have to select the destination floor
Doesn’t have any pauses built into its behaviours (allowing passengers time to enter and exit)

We might visually represent it like this:

(American readers will be happy with the numbering above, but may not realise we Europeans call the storey at ground level the “ground floor” or simply “0”. I prefer the European convention, but we’ll go with the American convention for one simple reason: TLA+ sequences are 1-based!).

You might at this point wish to politely point out that elevators without doors are a little dangerous. But they wouldn’t be the first — I spent many a happy afternoon hopping on and off the paternoster elevator during my university years in Sheffield. For now, let’s accept it as a valid simplification.

Let’s make a start

Let’s start writing our TLA+ spec. We’ll start by trying to define our constants, variables and initial state. Let’s say the set of known Storeys is a subset of the set of Integers, with the car’s initial location being at an arbitrary storey.

We’ll prefix our state formulas and actions with EC for “Elevator Controller”.

CONSTANTS StoreysVARIABLES carLocationASSUME Storeys \subseteq IntECInit == carLocation = CHOOSE s \in Storeys : TRUE

Remember that the CHOOSE operator indicates that an arbitrary value could be selected for carLocation from the set of available storeys. This is not the same thing as saying it WILL select one of those values or telling it HOW to select one— we are building a specification to check, not code to execute.

UPDATE: Following helpful feedback from the TLA+ community, it was pointed out that the use of the CHOOSE operator here could mask an error, as it may leave some states unexplored (it will fix the carLocation to an arbitrarily chosen value). Instead, we can simply use carLocation \in Storeys to explore all possible starting states.

Are we happy with that? Well, for one thing, it assumes the system is starting up with the car positioned at a storey and not between two storeys. Probably a reasonable assumption for the time being. Perhaps we should’ve called the carLocation variable carLastHaltedLocation, but let’s keep it simple for now!

What else can we assume about our starting state? We can assume something about the car’s movement. For now we will define the car’s potential movement states as “idle”, “going up” or “going down”, which we will represent numerically as 0 for idle, 1 for going up and -1 for going down.

The changed bits of the spec now look like this:

VARIABLES carLocation, carTravelStatusECInit ==
    /\ carLocation \in Storeys
    /\ carTravelStatus = 0

We can also assume something about the status of the two ‘call’ buttons on the walls and the ‘go’ button inside the car, namely that they are all in a ‘clear’ state.

VARIABLES carLocation, carTravelStatus, callRequest, goRequestECInit ==
    /\ carLocation \in Storeys
    /\ carTravelStatus = 0
    /\ callRequest = [s \in Storeys |-> 0]
    /\ goRequest = 0

I’m not going to explain every concept or construct as we go along, but this one, or a variant of it, is something you will see a lot of in TLA+ specs:

callRequest = [s \in Storeys |-> 0]

callRequest is here defined as a function where each value of set Storeys maps to the value of 0. As the set Storeys currently only contains the values 1 and 2 for our simple elevator, this might be written as follows: callRequest[1] = 0 and callRequest[2] = 0.

This is an example of a very elegant operator in TLA+, which enables you to describe something fairly complex about the properties of the system you are modelling without having to worry about how you might implement it in code.

What’s next?

We’ve defined a fairly rudimentary starting state. Now we will go about identifying some potential next states (known as “actions” in TLA+).

Car is called

The first ‘next state’ action we will describe is that the elevator is called from the other floor. Visually we could represent this as follows (a button is lit up after being pressed):

Here’s how we’ll write this in TLA+:

ECNextCall(s) ==
    /\ callRequest[s] = 0
    /\ callRequest' = [callRequest EXCEPT ![s] = 1]
    /\ UNCHANGED << carLocation, carTravelStatus, goRequest >>

Let’s break this down into plain English as best we can.

callRequests[s] = 0 means that, for the storey our elevator is being called from, the callRequest record has a ‘before’ value of 0 (i.e. it wasn’t already called).

callRequest’ = [callRequest EXCEPT ![s] = 1] means that the elevator hasn’t been called from any other storey, only s, so the value of the callRequest[s] will be 1.

The UNCHANGED line means that none of our other variables are modified.

So essentially what we have described here is that one possible next state is that a call request is made, turning the 0 to 1 in the relevant callRequest record, and nothing else changes in the process.

Service a call

Another ‘next state’ action we will describe is that the elevator proceeds to the storey from which it has been called:

In our specification, we now need to reflect the fact that the car is moving in this state:

ECNextServiceCall(s) ==
    /\ carLocation # s
    /\ carTravelStatus = 0 
    /\ carTravelStatus' = IF s > carLocation THEN 1 ELSE -1
    /\ UNCHANGED << callRequest, goRequest, carLocation >>

What does this mean?

carLocation # s means that the car’s current location is not the same as the storey it was called from.

The two conditions for carTravelStatus mean that the car was not already moving, and that it either goes up or down depending on whether the call came from the upper floor or the lower floor.

The UNCHANGED bit means that the callRequest remains in place, and carLocation remains unchanged, until the elevator gets there (and goRequest is also unaffected).

Car arrives after call

Our next potential action will describe the car arriving at the requested floor, as shown by:

We will specify this in TLA+ as:

ECNextArriveCall(s) =
    /\ carLocation # s
    /\ carTravelStatus # 0
    /\ callRequest[s] = 1
    /\ carLocation' = s
    /\ carTravelStatus' = 0
    /\ callRequest' = [callRequest EXCEPT ![s] = 0]
    /\ UNCHANGED << goRequest >>

Breaking this down:

carLocation before entering this state was NOT the storey being arrived at (s), but after entering this state it is.

carTravelStatus # 0 describes that the car was in motion, but upon arriving carTravelStatus’ = 0 means it then stops moving.

callRequest was 1 for the target storey before the car arrives, but becomes 0 after the car arrives; all other callRequest records remain unchanged.

Go request

So people can request the car from another storey, but passengers can’t yet tell it to go once they get in! So let’s visualise that:

We will specify this as follows:

ECNextGoRequest ==
    /\ goRequest = 0
    /\ carTravelStatus = 0
    /\ goRequest' = 1
    /\ carTravelStatus' = IF carLocation = 1 THEN 1 ELSE -1
    /\ UNCHANGED << carLocation, callRequest >>

So when the passenger hits the Go button, if it wasn’t pressed already, the car starts moving towards its destination.

Oh wait. There might be a flaw in this. Just by starting to write out the action formula, and identifying the preconditions (unprimed variables) and postconditions (primed variables) of the action, it made me realise something: what if I got in the car, and before I could hit the Go button, someone else on the other floor hit the Call button? Should I then be able to hit Go anyway and see it light up? I think I should.

The point here is not that we need to find the perfect solution right away, but the act of attempting to write the specification has forced me to think about edge cases or “non happy path” behaviours. It’s a perfectly valid behaviour for an elevator — and one that is no doubt happening all over the world right now as we speak — for the car to start moving towards another storey before the passenger has managed to select their preferred destination. TLA+ helps us to reason about these behaviours in a formal way.

For now, let’s split out the Go request from the servicing of the request:

ECNextGoRequest ==
    /\ goRequest = 0
    /\ goRequest' = 1
    /\ UNCHANGED << carTravelStatus, carLocation, callRequest >>

Translated, this now means the goRequest moves from a 0 to 1 when entering this state, and that’s it. We’re assuming nothing else about the location or travel status of the car, or any call requests made from the storeys.

Service the Go request

So now we can have an unfulfilled Go request, we need an action where that request is serviced — i.e. the elevator heads to its requested destination.

ECNextServiceGoRequest ==
    /\ goRequest = 1
    /\ carTravelStatus = 0
    /\ carTravelStatus' = IF carLocation = 1 THEN 1 ELSE -1
    /\ UNCHANGED << goRequest, carLocation, callRequest >>

That is: given a go request and a non-moving car, in this state the elevator will start moving up or down, depending on the current location.

Car arrives after Go request

Similar to the call request, we will specify the action where the car arrives after fulfilling its go request. Note that in this action and the previously described action, we did not specify (s) behind the action name. This is because it does not need to be parameterised for the storey — there’s only one go button, whereas there are n call buttons (where n=2 in our simple elevator system).

ECNextArriveGoRequest ==
    /\ carTravelStatus # 0
    /\ goRequest = 1
    /\ carLocation' = IF carLocation = 1 THEN 2 ELSE 1
    /\ carTravelStatus' = 0
    /\ goRequest' = 0
    /\ UNCHANGED << callRequest >>

As the elevator arrives from servicing a go request, the carLocation will flip from either storey 1 to storey 2, or vice versa.

The carTravelStatus will change from moving (not 0) to not moving (0).

The goRequest will be cancelled once the car arrives; callRequests remain unchanged.

Putting it all together

We’ve so far specified an elevator system that has one car, travelling between two storeys, which can be called by waiting passengers, will move to service such a call, and will cancel the call request once it has arrived.

If a passenger enters the elevator car and hits the Go button, the car will move to the other storey, and when it arrives, the ‘go’ request will be cancelled.

The full next-state formula

Given all the individual actions described above, our full next-state formula can be defined as follows:

ECNext ==
    \/ \E s \in Storeys :
        \/ ECNextCall(s)
        \/ ECNextServiceCall(s)
        \/ ECNextArriveCall(s)
    \/ ECNextGoRequest
    \/ ECNextServiceGoRequest
    \/ ECNextArriveGoRequestvars == << carLocation, carTravelStatus, callRequest, goRequest >>
Spec == ECInit /\ [][ECNext]_vars

In case you weren’t already aware, the indentation in TLA+ is significant. The above means that our universal next state (ECNext) is a logical disjunction between the three possible actions parameterised by a storey identifier (elevator is called) and three possible actions without a storey identifier (elevator is asked to go by a passenger inside it).

The construct \E s \in Storeys : can be read as “there exists a value s in the set of all Storeys where…”. The TLC model checker will interpret this construct by attempting to fulfil the next state using any valid value from the set Storeys.

Invariants

We also want to describe in a formula all of the things that should remain true in all states of our system. We’ll start simple, as follows:

ECTypeOK ==
    /\ carLocation \in Storeys
    /\ carTravelStatus \in {-1, 0, 1}
    /\ callRequest \in [Storeys -> {0, 1}]
    /\ goRequest \in {0, 1}

The above is pretty uncontroversial. We’re checking that the car is always at a valid floor, that it’s always either stationary, going up or going down, that the callRequests are from the valid combination of call requests for our known floors, and that the Go request is either on or off.

These aren’t particularly taxing — we’ll have a go at making those a bit more meaningful later.

Does it work?

We’re ready to give our specification a spin. I won’t share the complete source code, as I think there’s more value in you putting it together yourself and understanding each bit at a time :)

Here’s how your model might look:

When we run this through the model checker, it works! We get 90 states and 32 distinct states. We have a working elevator!

What have we learned?

This was a pretty simple elevator, and only our first iteration.

But despite this, I think that just going through the process of identifying the different significant components of our system, and considering how that translates to variables and state changes (actions), helped me to conceptualise my system better.

We’ve made several important simplifications, such as not modelling doors, and keeping to one car and two storeys. By identifying those simplifications and attempting to codify them in a TLA+ specification, I feel I have a better understanding of the dimensions along which my system is going to get more complex.

Although I didn’t talk you through every step of my thought process, I found the construction of the next-state formulas (“actions”) to be particularly useful. In my first attempt, I started writing the ECNextGoRequest state formula with a behaviour in mind that as soon as the elevator passenger hit the Go button, it would set off for its destination. However, just by starting to write down the conditions of this action formula, I realised that there was a perfectly valid edge case behaviour I’d missed, and this led to the addition of the ECNextServiceGoRequest action.

The very act of trying to describe your system in unambiguous terms is itself a very thought-clarifying process. This is why many advocates of TLA+ consider the journey of formal system specification to be just as important as the destination.

There were other micro-decisions like this that I’ve not documented in this article which helped me conceptualise my system.

There were also decisions I’m not happy with yet: for example, the formulas for servicing/arriving for calls are very similar in nature to those for Go requests, and I wonder if I should refactor them... but then I am also aware that I’m not trying to write clean code here, I’m trying to model a system’s intended behaviours with as much clarity as possible. So the jury’s out on that one! I’m sure the dilemma will evolve as we add complexity to the system.

A couple of gotchas

There were a couple of things I noticed were wrong as I was writing my specification. Did you notice them too?

I’ll let you try to find them. If you spotted them straight away, then well done, you are clearly able to reason about multivariate state machines better than most! But even if you are feeling smug about that, I hope you can also see that spotting anomalous behaviours in a system will only get exponentially harder as we increase the complexity. After all, this is a really, really simple elevator!

If you didn’t spot them, sorry but I’m not going to reveal them yet. I want to see if the next iteration of our specification helps us to tease out these anomalies.

The model checker didn’t spot these gotchas because our model is currently very naive. Our invariant check formula only checks for things that really ought to be true unless something went badly wrong (e.g. we somehow managed to send our elevator to floor 3 even though we don’t have one).

It doesn’t check for more subtle behaviours that might result in our elevator behaving erratically or counterintuitively for the passengers. For these we need more sophisticated checks — and this is where the T in TLA+ comes in. We need temporal checks (i.e. things that we want to be true across an end-to-end behaviour, rather than true simply for all reachable states).

This will be the subject of the next blog post in this series.