Understand Paxos with Rust, Automerge, and TLA+ — Part 2: Election

Gregory Terzian
8 min readOct 2, 2023

--

In Part 1, we went through the core single-decree synod algorithm of Paxos. Some readers may have noticed one issue with it: competing proposers could keep trying to get acceptors to commit to ever higher ballot numbers, thereby preventing each other from making progress.

Indeed, Paxos guarantees safety — only a single value will ever be chosen — it does not guarantee liveness. To ensure liveness, we need to add a little feature to any practical implementation: leader election.

Participants waiting for the election to start. Image copyright: copyright is believed to belong to the distributor of the Johnie To film, Election(2005), Tartan Films (United States) Optimum Releasing (UK) China Star Entertainment Group (China), the publisher of the film or the graphic artist

To Paxos or not to Paxos

Reading Paxos Made Simple — the paper with the notorious ambiguity — I stumbled upon the below statement:

The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.

My initial reaction was: “how?” Which was followed by the idea that “the algorithtm chooses” means “Paxos chooses.” Eureka! Since Paxos can be used to reach consensus on a value, we can use it to reach consensus on the leader.

While this may be one possible way — if you squint you can see that Raft leader election is in fact Paxos — it comes with the same liveness problem mentioned earlier. While a change of leadership should be rare, each occurrence could leave your implementation unable to make progress — hardly a comforting thought when you’re leaving the office for the weekend.

In the original Paxos paper, Lamport hints at a more robust way to do election, and we’ll take our inspiration from that. But first, an explanatory note. When Lamport is merely hinting at how to do election, that does not mean he left out an important part of the Paxos algorihtm; rather, it is because of two things:

  1. Leader election is, compared to consensus itself, a trivial problem to solve.
  2. Leader election is separate from Paxos.

This is what makes Paxos simple, and the original paper a joy to read: the unimportant parts are left out.

Finally, the safety of Paxos is guaranteed no matter what election algorithm you choose to pair it with. Since the downside is limited, this calls for aggressive experimentation, and a bit of fun: let’s design, using TLA+, and implement, using Rust, our own election algorithm.

Full spec and code available at https://github.com/gterzian/understand-paxos

Specifying election

Our election will be based on two premises:

  1. Rank, by participant name.
  2. Whoever happens to be leader and actively participating.

The second part is meant to prevent a leadership change when a higher-ranking participant (re-)joins the implementation.

Update: the above is nonsense. The protocol described here does not fulfill the second requirement. I’ve become aware of this in Part 3, and fixed it there.

Let’s go straight to the point, and specify our algorithm with TLA+. Since the spec is small, and it’s not based on a precise description found elsewhere, we will go through each action.

First, the Init action, which defines system “startup”.

This also introduces the variables used in the spec:

  1. epochs is a way to track time, using a kind of Lamport timestamp.
  2. leaders is a way to track whether a participant considers itself to be leader.
  3. confirmed is something I added just to make some assertions about the spec. This time I actually struggled coming-up with the “boring” invariant, instead I found it more natural to find one in the inductive style. So I added confirmed just to enable a boring invariant.

Note that both epochs and leaders are doubly-scoped by participant, which means they are local views of what participants believe. But, we can still make global assertions by making comparisons across participants.

The second action shows a participant becoming a leader.

The decision to become leader is made using only the local view of the participant, and the epoch is incremented.

The third action shows a participant syncing with others: the equivalent of receiving a message containing other participants views at a given time. This matches nicely with the way we will implement the algorithm using Automerge wrapped by Automerge-repo.

In this action, the participant uses the global data to:

  1. Set it’s epoch to the max of all epochs.
  2. Determine, if it was a leader to begin with, whether it still is a valid leader.

A side-effect of this is setting confirmed to be the set of all participants that the participant believes to be leaders following the synchronization(we’re using the primed leaders). As we shall see, this is used purely for making correctness assertions.

The following two actions provide us with: a heartbeat mechanism, where the participant increments its epoch, and with the crash of a participant, resulting in loss of all local data.

Finally, we put all actions together into the Next action.

Now we need to make assertions about the correctness of the spec. The boring invariant is the below.

This asserts that the set of confirmed leaders should never contain more than one participant: there can be multiple participants who believes themselves locally to be leaders, however following communication with each other, only one “true” leader should remain. This invariant asserts the that. The more important part is the why, asserted below by the inductive invariant I found(once again I am only somewhat confident it is inductive, and that it implies the safety property above).

In English, it means: for a given participant, if the participant believes itself to be leader:

  1. For all other participants, if the two have synced and the other also believes itself to be leader, then the “other” is in fact the same one and only.
  2. For all other participants that are truly “others”, syncing would imply “accepting an operation”, which in this case means stepping down if the other believed itself to be leader(Maybe I should rename this to WouldStepDown).

As a result, there can be only one “true” leader.

There can be only one. Image copyright: copyright is believed to belong to the distributor of the Johnie To film, Election(2005), Tartan Films (United States) Optimum Releasing (UK) China Star Entertainment Group (China), the publisher of the film or the graphic artist

The Rust Implementation

Once again, we’ll use Automerge wrapped by Automerge-repo allowing us to focus on a kind of “syncable semi-shared state” instead of network messages. This state is shown below.

Compared to the spec, we work directly on a map of participants: an automerge document is by default a local view of global data, so we can dispense with the “map of map” found in the spec.

This highlights a key feature of specifying, and implementing, systems: the need for human judgement. When writing a spec, you have to decide that the spec reflects the important properties of the implementation under consideration. And when implementing the spec, you have to decide that the code reflects the key properties of the spec.

It would be nice to be able to express this judgement formally: this piece of Rust code(itself a specification of a physical system of silicon) implements this TLA+ spec under an appropriate refinement mapping. It would be even nicer if a first draft of that could be generated automatically. But even then, human judgement would be required on whether the system satisfies certain needs, and what those needs are in the first place(what problem are you solving?)

Back in our implementation, each participant runs two tasks:

  1. A heartbeat
  2. The leadership algorithm.

The heartbeat increments the epoch of the participant, as seen below.

The leadership algorithm is a combination of the BecomeLeader and UpdateView actions found in the spec, reflecting the fact mentioned above that the implementation operates by default on a local view.

The interesting bit is that I wanted to add some sort of assertion of correctness to the algorithm, and as I struggled with it I finally came-up with what you can see below.

If running the algorithm determines that the local participant has been newly elected, we re-run the algorithm — based on the updated state — for each other participant, and assert that any other leader would step down, and that no other follower would consider itself to have been elected.

Now off-course, by the time the other participants receive this update on their end, their local state may have changed as well: they can then run that same algorithm and determine that others, based on the current local state, would accept their update.

Locally, multiple participants may consider themselves to be leader, but communication will see them converge, for a given state, to a single leader. And since a leader wanting to participate in Paxos needs to communicate with others, the “true” leader will emerge directly from that communication.

I also have a feeling that running that little simulation above will useful to determine whether an incoming Paxos message should be taken into account or ignored. We shall see, as we move into our third, and final, part: the multi-decree parliament.

Although I write “final”, many appendices may follow, dealing with such mundane matters as client interaction, log compaction, or distributed locking.

Thank you for reading, and stay tuned.

--

--

Gregory Terzian

I write in .js, .py, .rs, .tla, and English. Always for people to read