Understand Paxos with Rust, Automerge, and TLA+ — Part 1: The Synod

Gregory Terzian
12 min readSep 25, 2023

What a computing device does next depends on its current state, not on what steps it took in the past. Leslie Lamport, in Teaching Concurrency.

Paxos is a consensus algorithm discovered by Leslie Lamport in the early 90s. As often happens with relevant new ideas in a field, it initially elicited the “nothing new” response: publication was delayed by about a decade. Entitled “The Part-Time Parliament”, the paper uses the metaphor of a parliament in ancient Greece — one in which lawmakers are too busy with other pursuits to attend sessions regularly — to present a solution to the (non-byzantine) consensus problem.

For some reason — perhaps just because it tends to be repeated — the algorithm comes with the reputation of being hard to understand. When I first started looking seriously into distributed systems, I agreed. Raft — another consensus algorithm published in 2014 by Diego Ongaro — appeared to me as more understandable. However, three years later, I am not so sure anymore.

What changed is that I made efforts to learn TLA+, and have become increasingly aware of the need to think outside the box of code(still a work in progress). And when I told myself three years ago that I “understood Raft” — having read the paper, some blogs, and toyed with an implementation — I actually had no idea of what I was talking about. My understanding, if there was any at all, was far shallower than I thought.

So, this blog post will try to describe what I think is necessary to understand Paxos, and for that matter Raft and any other concurrent algorithm as well. This will be done “diary style”, as I am writing my own TLA+ specification of Paxos, and implementing it in Rust.

The full spec and code is available at https://github.com/gterzian/understand-paxos/.

Your behavior is impossible

There are, as far as I am aware, two different ways to reason about a concurrent algorithm:

  1. Behavioral reasoning: understanding the algorithm from the perspective of a series of steps.
  2. States and invariants: understanding the algorithm from the perspective of a given state, and the invariants that should be maintained at each step that can change that state.

A behavioral proof consists of assuming an illegal state, and then finding a contradiction by running through a series of steps. Below is an example of a behavioral proof (you can skim over it).

Excerpt from page 109 of Chapter 4 of this excellent course on concurrency.

When I read the above, as I was going through these excellent course material, my hearth sank: it’s a toil to read. Fortunately, the author of the course goes on to state the below:

The author then gives an excellent description of the second type of reasoning about concurrency.

What is this inductive invariant that the author refers to? There are two types of invariant:

  1. The “boring” invariant that expresses the correctness of your algorithm. For example, mutual exclusion can be expressed as the invariant that only a single process can be in its critical section at a time.
  2. The more interesting inductive invariant that expresses why your algorithm is correct — why can only a single process be in its critical section at a time?

The difference between the two invariant can be explained as follow: in a given algorithm, one can come up with a state A in which the boring invariant holds, followed by state B in which it does not. This does not mean that the algorithm is incorrect, because it so happens that state A is not reachable. On the other hand, the “inductive” part of an invariant means that if it holds in any state, it must also hold in any subsequent state. The inductive invariant explains why state A is not reachable.

The key that unlocks understanding of an algorithm is finding the inductive invariant that implies(is stronger than) the boring invariant that is the safety property of your algorithm. And, it turns out, looking for that inductive invariant is fun.

A practical guide to finding, and proving, an inductive invariant has been provided by Lamport in “Proving Safety Properties.”

Raft is not so simple

Good software design consists of breaking a problem into independent parts, and reducing the state space as much as possible. So, on the face of it, the design choices of Raft — problem decomposition and simplification of the state space — seem to make sense. But, good software design is not the same thing as good concurrent algorithm design.

When thinking behaviorally — through a series of steps — reducing the state space(the potential steps), and breaking the algorithm into parts(each with its own behavior) may give you the impression of simplicity. But, behind this impression, an incomplete understanding of the problem often hides.

When following the more robust invariant-based approach, the design goals of Raft make less sense: if you are not thinking in terms of steps, why reduce the state space? And if an invariant is best expressed as a global concept, why try to break it into pieces?

In the Raft user study — where students where first taught and then quizzed about both Raft and Paxos — all questions and answers come in the behavioral flavor: it may well be that Raft did a good job of making itself more understandable from a behavioral perspective. But how many participants can state the inductive invariant that makes Raft tick? If taught to think differently, would students perhaps find Paxos to be simpler? This leads me to conclude that the simplicity of Raft over Paxos is an illusion, and that it is rather programmers that need to make an effort and learn to think differently about concurrent systems.

So, lets take a first step and try to understand Paxos: by focusing on states and invariants.

Paxos Revisited — The Synod

Using Paxos to build a replicated state-machine relies on a core algorithm: the single-decree synod algorithm that enables participants to reach consensus on a single value. This core can then be expanded to reach consensus on a sequence of values, which can be used as the log of commands for a replicated state machine.

We will first start with the basic algorithm, and cover it’s expansion in a subsequent article. The algorithm we will cover is The Basic Protocol described in 2.3 of the original Paxos paper.

Participants in the protocol each own a ledger, whose goal is to record a single value on which consensus has been reached. The “back of the legder” is also used to store certain important information on the state of the algorithm; a “sheet of paper” is used for less important information that can be lost. Participants communicate by sending each other messages, using rather unreliable messengers resulting in potential loss, re-ordering, and duplication of messages. But the messengers are trustworthy: they do not temper with messages.

Each participant will try to get a value chosen by others, while also accepting values proposed by others. Eventually, consensus can be reached on only a single value. This can be informally described by breaking the proposals into separate phases:

  1. A participant first tries to get a quorum of participants to commit to a given ballot, identified by a unique number.
  2. Once a quorum has been gathered, the participant will choose a value for the quorum to vote on. The choice is made by taking the previous vote with the highest number in the quorum, or, if there were no previous votes, by taking any value. After making this choice, the participant will ask each member of the quorum to vote on the chosen value.
  3. When a participant receives a proposal issued by 2 above, it will vote on it, unless it has already committed to a higher numbered ballot. A successful vote is communicated back to the participant who proposed it. Note that a participant can still commit to a new ballot after having voted, but that then the previous vote will be taken account in the choice of value to be proposed for that new ballot(2 above).
  4. The proposing participant will only write the value to its ledger if all members of the quorum reply with a positive vote.
  5. Once a participant writes a value in its ledger, it communicates this to all other participants, who will write it in their ledger as well.

Following this informal, and rather ambiguous, description, we will get straight to the point by writing a TLA+ specification, defining a safety property for the spec, and making an attempt to find an inductive invariant implying the safety property.

The TLA+ Spec

The spec consists of several actions, representing the atomic steps participants can take to go from one state to the next. As an example, below is the step that represents a participant committing to a given ballot, and only doing so if the number of that new ballot is higher than the last one committed to.

Note that we use a set of messages to represent the fact that messages can be received in any order, and can also be duplicated.

We also define an action representing a participant losing all it’s data, except that “written on the back of the ledger”, as well losing all messages addressed to this participant.

Finally, all actions are put together as the next-state action.

We will not go through all the details of each action, instead focusing on the invariants of the spec.

The first one is the invariant asserting type-correctness. It is perhaps the most boring of them all, but it is important.

Note that this invariant is in fact inductive, but it does not imply the safety property of our spec, so it is not the inductive invariant we are looking for(although it forms part of it: when checking the inductive invariant we will find, we will always conjunct this one with it).

The safety property is expressed as the invariant that if any participant has written a value to its ledger, then all others must either not have written a value(yet), or have written the same value. I call this one the CoherenceInv.

This invariant is not inductive: one can come up with a state in which it would initially hold, followed by another step in the Next action in which it would not. So, the hard part is finding the invariant that is inductive and implies CoherenceInv. I call it PaxosInv.

Now there’s a lot there, so let’s explain how we got here.

The first thing I did was thinking: what is necessary for a given participant to write a value to its ledger? The answer is: it requires a quorum to have voted for that value. This is expressed by ledger[p] \in Number => majorityVoted, which reads as: “if a participant has written a value to its legder, it implies that the value was voted on by a majority”.

Remember that “implies” means “is stronger than”, which means that even if a participant has not written a value to its ledger(yet), it can still have been voted on by a majority. But the opposite, that a value was written into a ledger but not voted on by a majority, is not.

To this, I’ve added the conjunction that says ledger[p] \in Number => higherVoteSameValue. This is true because even after a participant has written a value into its ledger, another higher ballot could still take place, however that ballot will be for the same value. Why? Because any other majority quorum will end-up voting for that value, since it will include at least one member of the quorum that voted for that value and it will be the highest vote among the new quorum.

That “the value of the highest vote among the quorum is always voted by on the quorum, if any value is voted on” is expressed by the second part of the invariant.

This essentially says: “for a given ballot(regardless of whether it forms a quorum by the way), from the perspective of a participant p, for all other participants pp in the same ballot, if p has a higher numbered previous vote than pp, than no other participants in the ballot will vote on pp’s value, whereas if pp’s previous vote is higher, than no other will vote on p’s.

Now, am I hundred percent sure that this is the invariant I am looking for? The answer is: no. I have used the three “quick and dirty” methods for checking it using TLC as described in Proving Safety Properties(expressed through the three difference theorems in the module), but I have not used TLAPS to write a “rigorous informal proof”, and I have definitely not written any “formal, machine-checked proof”. Those are fights for another day. What I do know for sure, is that, by focusing on trying to find an inductive invariant, I am in the right direction.

I’ve also noticed that, when trying to find this inductive invariant using TLC, each time TLC fails, you really only need to look at the last state. Only when there is still a bug in the spec itself do you need to trace back and read through a sequence of states to understand what went wrong(how on earth did it get into that state?) But once those bugs have been ironed-out, it really is a matter of only looking at the state in which the pseudo-inductive invariant fails. That last state contains all you need to know to make a change and try to make the invariant pass. Working like this is fun; much more fun than running scenarios in your head about previous steps taken by the algorithm.

Now the time has come to implement our single-decree synod using Rust.

The Rust Implementation

Regarding the Rust implementation, I’m tempted to say: “nothing to see here,” but we’ll go through some features of it quickly anyway.

We’re using Automerge wrapped by Automerge-repo, a setup introduced in a previous article. The main advantage of this is that we don’t need to mess with a network layer: instead of sending and receiving messages, we work directly with an automerge document, and watch for changes to it. Each participant owns a piece of the document: sending a message is replaced by making changes to it, and the equivalent of receiving a message happens when other participants notice that change.

Each participant process runs three tasks:

  1. A proposer task, that starts new ballots and tries to get them voted, writing the value to its ledger if successful.
  2. An acceptor task, that watches the document for new ballots and votes for them under the right conditions.
  3. A learner task, that watches the ledgers for changes, and writes incoming changes to its own.

As a quick example, below is the loop of the learner task.

Once again, I found modelling peer communication through an Automerge document to be straightforward.

One thing to note is that Paxos requires the values written “on the back of the ledger” to be persisted to stable storage before sending out any message with regards to those values. I then realized that in Automerge-repo, sending out sync messages and saving the document is done concurrently, hence if a crash occurred before a completed save, but after sync messages had been sent out, then the safety of the algorithm would be broken.

For now I solved this with an implementation of Storage that blocks the repo backend until saving is completed, and by making a change to the backend that ensures that for a given change, saving is done before sending out any network messages. Blocking the backend is not good for throughput, and doing this properly is under discussion.

Update: the above is nonsense, since the storage implementation uses a temporary directory that will be deleted once the application exits(I assume that includes crashing). For now I will not make the example crash-resistant, instead in Part 3 I switched to simulating crashes.

So, I hope you now have a good understanding of the single-decree synod algorithm of Paxos, and that you do not find it too complicated. I also hope that I’ve given you a taste of what it’s like to work with inductive invariants, and have convinced you that it’s worth a try for your next concurrent design.

In a follow-up article we’ll look into expanding this work into a replicated state machine built on top of Paxos. Thanks for reading, and stay tuned.

--

--

Gregory Terzian

I write in .js, .py, .rs, .tla, and English. Always for people to read