A TLA+ specification for Gryadka

A few months ago, there was a post on HN about something called Gryadka. It claims to be a Paxos variant that provides a rewritable compare-and-swap (CAS) register. The best description of the algorithm I could find was in this post, but even that is somewhat tricky to decipher.

When dealing with non-trivial distributed systems, I find it best to write down the algorithm as a TLA+ specification. This is useful in order to specify the algorithm in unambiguous mathematical terms, instead of prose. Reasoning about distributed systems is hard — I wasn’t able to convince myself that it obviously works, even understanding how Paxos itself works. Writing a TLA+ spec means you can model check it to get confidence that it actually works. I hope this will be useful as a practical example of how to write TLA+ specifications.

Without further ado, here is the full specification: https://gist.github.com/grogers0/c7e87f9dfe58c6070b19db9d3c073b72

Here you can see the constants being defined. In TLA+, constants are essentially parameters to the specification. When model checking, you can vary the constants to e.g. change the size of the simulation. Assumptions are clauses that are checked once, and if the assumption fails, TLC will give you an error and refuse to run. I typically use them as assertions about the constants — you can see various type checking assertions as well as assertions to ensure that all quorums intersect by at least one element.

Here I define the variables. In TLA+, each successive action changes the variables to new states. Here, ops and history are artifacts of the way I’m creating the specification. In order to ensure certain invariants, the actual order of execution (history) is a legal ordering according to the timestamps of individual operations (ops).

Here is the type-checking invariant for the variables (TypeOK). This is a common practice in order to document the types of each variable. For instance, you can see each possible message type and that the per-acceptor variables are functions from the acceptor to some value/timestamp.

I also define the initial state of each variable here. Particular to note is that while most variables are initialized to some base starting state, ops is initialized to the actual operations that will be attempted. It might seem strange to determine which operations will happen ahead of time, but it simplifies things quite a bit. Without getting too philosophical, at any state of execution of the actual system, it would be impossible to tell whether the executed operations were predetermined or not. Because TLC checks every possible starting state, every sequence of operations that could possibly happen is checked.

Another important thing to note is that reads aren’t explicitly modelled here — all operations are CAS. In actuality, reads in Gryadka are handled the exact same way as CAS(x, x), where the old value x is discovered during the prepare phase and the new value is the same value x. In this specification, the value being read is assumed to be known ahead of time, similar to how the old value of a CAS operation is chosen ahead of time. Treating reads as CAS operations is an inefficiency in the algorithm — it means that any read prevents all other concurrent reads and writes from completing. An ideal system would handle reads in a wait-free manner, while still maintaining obstruction-free CAS operations. I think this might be possible with an extension to the protocol, but I haven’t explored it in depth (so it probably isn’t — distributed systems are hard).

This is the next state relationship. This is the last piece of boilerplate that you’ll need when writing a specification. Every possible next action that could happen is listed. Either the proposer prepares a new operation, accepts an operation that’s received a quorum of “promise” responses, or acknowledges to the client an operation that’s received a quorum of “accepted” responses. Processing messages happens similarly — for any message in the buffer, try to process it.

I’ll explain what each of these actions do in more detail below, but remember that each of these represents one single atomic transition between states.

Along with a helper that checks which messages have already been sent, this is the prepare phase. The precondition is that we haven’t already sent any prepare messages for this timestamp. In practice, this means that each timestamp is unique and attempted only once. This is a more restrictive condition than is strictly needed — it isn’t even possible to implement the algorithm the way it is shown as we would need to see messages that other proposers sent. That’s okay! The implementation doesn’t have to exactly match the specification. It is usually desirable to make the specification as simple as possible. Lamport uses the same trick in his Paxos specification. Choosing the level of abstraction is a hard part of writing specifications. This specification also says nothing about durability or recovering from crashes. Creating such a specification is possible but would be more complicated.

The prepare phase here is identical to Paxos, except that we also keep a record in the history to be able to check later that our register is linearizable.

Here again when processing a prepare message, the Gryadka algorithm is the same as a plain-old Paxos write-once register. The precondition is that the acceptor’s current prepared timestamp must be lower than the received timestamp. Remember that if any precondition is false, none of these actions are taken. One other thing to note is that a message is sent back to the proposer, but the received message that initiated this action isn’t consumed. That simulates network duplication — we rely on the precondition for correctness.

After receiving a quorum of “promise” messages that satisfy the old value of our operation, the proposer sends out the “accept” messages. This is the difference between Gryadka and a write-once Paxos register. The value to be accepted is the new value of the CAS operation, while in Paxos we would send the PromisedValue as the value to be accepted (or if that was some “null” value, we would accept any value we desire). This is a tiny change, but it completely changes the algorithm. In my opinion, this is why distributed systems are hard to reason about — the effects of one small change have a massive effect on the overall system.

One thing to note here is that PromisedValue is always a value because we initialized the system with an initial value at all nodes. Incorporating a “null” value would also be possible but would complicate the specification here.

Processing the accept message at a given acceptor is straightforward, the timestamp of the message must match the timestamp the acceptor is prepared for. If it does match, the acceptor’s state is updated and an “accepted” message is sent back to the proposer.

After receiving a quorum of “accepted” messages, the proposer can respond back to the client that the value has been changed. This finalizes the operation, so the successful response is recorded in the history. That’s it! The core of the algorithm is only a few dozen lines of math and much easier to digest than the python code in the original post.

One thing you might notice is that this specification doesn’t include any sort of negative acknowledgement. Any operation that never completes successfully — either because of message delays, preconditions not being met, or any other reason — is simply never responded to. Again, that’s a deliberate decision to simplify the specification. You probably wouldn’t implement a system this way in the real world. We could extend the specification to simulate NACKs, but it doesn’t change the essence of the algorithm.

So we have our specification, and now we want to do something useful with it. We want to ensure some global invariant of the system — something that holds true in all reachable states. In the case of a CAS register, that means checking that the history is linearizable.

This looks really complicated, but let’s break it down into small pieces. First, some helpers — FiniteSeq and SeqAsSet. The Seq operator in the built-in Sequences module can’t be used in TLC because the set it creates is not enumerable, so instead we need to use the finite version of that. TLA+ is really good at working with sets, but AFAIK there isn’t a way to extract the values of a sequence into a set, so SeqAsSet will come in handy.

This invariant means that there must exist some total order of the timestamps of every operation so that all of the following properties hold:

  • Every operation in the original history that was responded to must be included in the total order.
  • For every invoke in the original history that occurs after a given response, the order of operations in the total order is the same as in the original history (subject to the invoke actually being responded to).
  • Any operation that happens directly after another must have the prior operation’s new value match the subsequent operation’s old value (i.e. the precondition for the CAS is met).
  • The very first successful operation in the total order must have its old value match the initial value.

In order to run the model in TLC, we tell it which invariants to check (Inv), uncheck “deadlock” because our specification intentionally bounded size, and set the values of the constants:

Timestamps <- 1..3
Quorums <- {q \in SUBSET Acceptors : Cardinality(q) > Cardinality(Acceptors) \div 2}
Values <- {"V1", "V2"}
Acceptors <- {"A1", "A2", "A3"}

It’s common to use fairly small values — even this model runs for about 7 minutes on my laptop. One nice thing about model checking is that even very small simulations tend to uncover bugs. This is one downside of model checking as opposed to proving an inductive invariant — you can never be completely sure that a bigger model with more states wouldn’t uncover a strange bug.

Running this, we get no errors — great! One thing to double check whenever you are writing a specification is that the coverage results don’t show any locations with Count=0. This would mean we wrote code in the specification that never even got used — very likely a bug in the specification.

One last thing to check is to play around with some parts of the specification and make sure to get expected failures of the invariant. One example is to change the value chosen in the PromisedValue operator from being the value corresponding to the highest timestamp, to being an arbitrary choice. This should break linearizability because the arbitrary value wouldn’t correspond to the latest committed value (which would have the highest timestamp). When we run that in TLC, as expected, a counterexample is found:

  1. Initial value = “V1”
  2. Prepare(ts=1) # oldval=“V1”, newval=“V2”
  3. RecvPrepare(a=“A1”, ts=1)
  4. RecvPrepare(a=“A3”, ts=1)
  5. Accept(ts=1)
  6. RecvAccept(a=“A1”, ts=1, v=“V2”)
  7. RecvAccept(a=“A3”, ts=1, v=“V2”)
  8. Accepted(ts=1)
  9. Prepare(ts=2) # oldval=“V1”, newval=“V1”
  10. RecvPrepare(a=“A2”, ts=2)
  11. RecvPrepare(a=“A3”, ts=2)
  12. Accept(ts=2)
  13. RecvAccept(a=“A2”, ts=1, v=“V1”)
  14. RecvAccept(a=“A3”, ts=1, v=“V1”)
  15. Accepted(ts=2)

Oops! We successfully changed the value of the register to “V2”, but then successfully commit a CAS(“V1”, “V1”), which should have failed its precondition in Accept. This trace is pretty easy to digest — the two operations aren’t even concurrent! TLC finds the counterexample with the smallest number of steps possible (in breadth-first mode, the default).

Thanks for reading!