Understand Paxos with Rust, Automerge, and TLA+ — Part 3: Multi-Decree

Gregory Terzian
5 min readOct 18, 2023

--

In the first part of this series, we covered the single-decree synod algorithm of Paxos. In the second part we covered a mechanism for leader election. Now, in part 3, we will specify, using TLA+, and implement, using Rust, a replicated state-machine using multi-decree Paxos paired with leader election.

Full spec and code available at https://github.com/gterzian/understand-paxos

The Spec

This time, writing the spec was the easy part. As it turns out, expanding the single-decree synod— where participants reach agreement on a single value — into an algorithm that covers multiple decrees— where participants reach agreement on a sequence of values using numbered instances of the synod — is easy.

Where data structures — read: functions — were indexed by — read: have a domain of — participants, they now are indexed by participants and by instances of the algorithm.

This is highlighted by the next state relation below.

https://github.com/gterzian/understand-paxos/blob/48961da3d5c85b57fd5b73d5e4a8274d2b56c160/multi_decree/MultiDecreeParliement.tla#L257

The first three steps of the synod algorithm — ChooseBallotNumber, HandleNextBallot, and HandleLastVote — are each performed per participant: once per change of leadership. The other steps are performed per participant and instance of the algorithm. This means that any number of instances can be ongoing at the same time and reach agreement in any order, which, combined with leadership changes, can result in temporary gaps in the sequence of value agreed upon(a later instance reaches agreement before earlier ones, followed by a leadership change).

To avoid confusing clients of the system(more on that below as part of the implementation), such gaps are filled with the “olive day decree”: a no-op decree. This mechanism can be seen below, in the definition of the BeginBallot operator(relevant parts highlighted).

https://github.com/gterzian/understand-paxos/blob/48961da3d5c85b57fd5b73d5e4a8274d2b56c160/multi_decree/MultiDecreeParliement.tla#L157

And that’s it; let’s move on to the implementation.

The Rust implementation

Once again, we’ll use Automerge wrapped by Automerge-repo, allowing us to focus on a kind of “syncable semi-shared state” instead of network messages. That state is shown below.

https://github.com/gterzian/understand-paxos/blob/48961da3d5c85b57fd5b73d5e4a8274d2b56c160/multi_decree/src/main.rs#L595

Each participant runs several tasks:

  1. run_proposal_algorithm deals with leader election, and with ChooseBallotNumber, HandleLastVote, BeginBallot(combined with HandleLastVote), and HandleVoted. It also executes the state machine.
  2. run_acceptor_algorithm deals with HandleNextBallot and HandleBeginBallot.
  3. run_learner_algorithm deals with HandleSuccess, and runs some safety checks.
  4. run_heartbeat_algorithm runs the heartbeat used in leader election.
  5. run_crash_algorithm will, if the --crash flag is passed-in for that participant, simulate the participant crashing and later recovering.

The below two tasks act as clients of the system.

  1. request_increment will continuously request new increments from other participants(only getting an answer back if the participant is the leader).
  2. request_read will continuously request a read from other participants(only getting an answer back if the participant is the leader).

The leader will queue up client commands — either read or increment requests — fill any gaps with no-op commands, and then dequeue commands and assign them an instance of the consensus algorithm.

Then, whenever agreement has been reached on an instance of the algorithm, it will write the value to its log, execute the state machine, and send responses to clients.

Executing the state machine means executing all commands from the first to the last(truncating logs and keeping past execution states is left as an exercise for the reader).

https://github.com/gterzian/understand-paxos/blob/12ff08307ce034c56b510a55ff4d8e51982088a9/multi_decree/src/main.rs#L115

Why are gaps filled with no-op commands? Imagine a client requesting an increment first, and a read second. Let’s say the increment is assigned as the 150th command, and then the read would be used to fill a gap in the 149th one(remember instances can execute in any order). The client could then see a lower value on the read, after having received a higher value on the increment request.

Such a scenario would still be a successful consensus — all participants would agree on the same sequence of commands — but for that particular client it could be rather confusing: so gaps are filled with no-op commands. Incidentally, our readers and incrementers always check that the value they receive back is either:

  1. In the case of an increment, always higher than the last value seen.
  2. In the case of a reader, either equal or higher than the last value seen.

Another thing to note is that I had to adapt the leader election mechanism for it to work as I wanted(the one described in part 2 would see the leadership switch each time a higher-ranking participant would re-start). The core of the algorithm can be seen below.

https://github.com/gterzian/understand-paxos/blob/c2a3153ce9fee8f081bbfdb3dc6bd7e58c71640d/multi_decree/src/main.rs

Fun fact to note: as I was experimenting with various leadership algorithms, and seeing some erratic behavior, safety was preserved at all times.

What is simplicity?

Going through the exercise of writing this series, the hardest part was, hands-down, understanding the single-decree synod. Somehow, cramming that intuition into your brain takes effort. But, once you do understand it — and for that nothing beats writing your own spec, without looking at existing ones, and trying to find that inductive invariant by yourself — it feels like having learned to ride a bike: you’ll never forget it.

Expanding the synod to cover a sequence of values is, in comparison, a piece of cake. Same goes for leader election, and for implementing a (primitive kind of) state machine. In all of these efforts, you can either rely on the synod algorithm to cover your bases, or on the fact that you’re dealing with a separate problem.

This, to me, is simplicity: a core that may take some effort in understanding, with the rest either building on top of, or being separate from, it. Simplicity also allows for variation: since the original paper was published, Paxos has been expanded into Disk Paxos, Cheap Paxos, Fast Paxos, Vertical Paxos, Stoppable Paxos, Byzantine Paxos and its leaderless variant, and Generalized Paxos(next on my list to print out for morning coffee mental gymnastics).

And so I conclude: Paxos is indeed simple, but understanding it takes some effort.

--

--

Gregory Terzian

I write in .js, .py, .rs, .tla, and English. Always for people to read