The Paxos algorithm
For one of my personal development projects I’ve been looking at implementing the Paxos algorithm.
Paxos is designed to allow a consensus value to be reached over an unreliable network. In such a network, each node might have a slightly different understanding of the overall picture. Paxos provides a way to be confident of an overall ‘bigger picture’, even if individual nodes have different data.
The principle behind Paxos is that each node can suggest a value to be the new consensus. A series of messages is sent in order to establish agreement between the suggesting node and the other nodes on the network. If a majority agree, then that value is considered the consensus value.
Paxos is quite complicated, so let’s see if we can use a more concrete example to understand how it works. We’re going to use Paxos to get a network of nodes to agree on the first sentence of ‘Alice in Wonderland’, by writing it one word at a time.
Let’s say we have a network of 10 nodes. The first node sends a suggestion message to all the other nodes with the first word, ‘Alice’. All the nodes agree, and send an acknowledgement back. The first node counts the acknowledgements, until it has received at least 6 — the majority. Once that happens, it sends a ‘consensus reached message’ telling the entire network that consensus value has been reached. All the nodes then update their view of the consensus to be ‘Alice’.
This is the simplest case, where every node on the network is functioning. But what happens when some of the nodes are down during this message cycle? Let’s see:
- Node 2 broadcasts a suggestion with the next word, ‘was’
- 3 of the nodes acknowledge the suggestion
- The remaining nodes are unavailable for whatever reason
- Node 2 is now waiting for a consensus to be reached…
Obviously we don’t want nodes to wait forever in such a scenario. Therefore Paxos doesn’t introduce such a constraint.
Suggestion messages are give a priority according to how recent they are. If another node decides to re-suggest the word ‘was’, perhaps at a time when more nodes are available, a consensus could be reached. Node 2 would see that the newer suggestion takes priority, and forget about attempting to reach consensus for its suggestion.
The final scenario I’ll discuss here is how conflicts are resolved. If two nodes suggest a contradicting value, then which one wins? Let’s see how this plays out:
Node 1 will suggest the word ‘thinking’, and Node 2 will suggest ‘beginning’. The suggestion messages are both broadcast at the same time, and some nodes agree to each suggestion. Node 1 and 2 are both waiting to see if they will get majority agreement for their values.
The key to resolving such an issue is actually quite simple — a majority of nodes must agree. Since messages are given priority based on how recent they are, it’s possible for a node to agree to multiple suggestions. A node only updates its state once it receives a ‘consensus reached’ message. If the ‘consensus reached’ message is associated with an old suggestion, then it will just ignore it. (even if the competing suggestions have the the exact same timestamp, another arbitrary measure of priority can be used, such as the nodes’ ID on the network).
Of course, it’s possible that no consensus is reached due to conflict, or unreliability in the network. That’s actually perfectly fine — we only want the value to update once consensus is reached. Nodes just have to continue suggesting their values until the network health improves.
This is a very simple rundown of the algorithm, and there are many edge cases not described here. However, I hope it’s enough to pique your interest in the algorithm. For a more detailed explanation I recommend this article. I followed the spec here in order to write my implementation.
My main insight around Paxos is how just a few simple rules can produce quite compelling results. Even though I didn’t understand it at first, by following the specification and almost blindly implementing a few basic rules of the algorithm I was able to get a better insight into how it works. Simply having written some code enabled me to get a better understanding of the fundamentals. As my work progressed, I was then able to identify and fix bugs that cropped up, which gave me more confidence in my understanding.
You can find my implementation of the algorithm here.