Paxos made simple
Paxos when presented with diagrams/pictures is very simple or not. (Lamport’s abstract for this paper is just one line that says: The Paxos algorithm, when presented in plain english, is very simple)
Distributed systems are implemented for high availability and scalability using many commodity machines. Such machines are not reliable entities and these come up and go down quite often. In such systems, there is often a need to agree upon “something” i.e. to have consensus. “Something” here depends on the context. For example:
In a digitized system(like bank) operating in many branches and doing many transactions, one needs to have a consistent view of all the accounts. At a given point, there is a consistent state of the bank, which in turn translates to the the correct amounts in all bank accounts. One way to achieve this is to maintain an ordered log of operations across a certain number of server nodes(One server node is not sufficient because it can go down, thus impacting availability and scalability). The consistent view of the ordered log on these server nodes ensures that there is a consistent view of of all the accounts in the system: Basically bank becomes a state machine which takes in as input the current state of all the accounts and ordered log of operations that adjust the account to reflect the newest transactions. From here on, we will only worry about these server nodes that need to maintain the consistency of the ordered log.
So each server node keeps its own log and nodes can communicate with each other to establish the order, as users deposit or withdraw money from their accounts. So with this, a client can connect to one server and another client can connect to a different server and still get the same view of the bank’s current state i.e. account values. Also, keep in mind that failures are non-byzantine i.e. server nodes in the bank are not incentivized to collude and maliciously alter the log. (A side note here is that bitcoin network also needs to solve the similar problem, but the failures can be byzantine due to the trust-less nature of the nodes)
Now let’s see how can all nodes can agree on a consistent view of the log. A simpler problem of this is first to figure out how nodes can agree on a single unalterable value i.e. Transaction X did Y. (Here transaction X is the index into the log. Let’s ignore that for now) and Y is the value that the network of nodes need to agree on and commit at index X in the log on all nodes. Let’s call this basic Paxos. Once we have established this, we can use it as a building block to achieve the distributed ordered log.
Consensus problem definition — Basic Paxos
- Even if a single value is proposed, the network eventually recognizes it and choose it. (C1)
- Once the value is chosen by the network, it cannot be overwritten.(C2)
- Nodes don’t hear that a value has been chosen, unless it has really been chosen by the network. (C3)
Components and states of a Proposal Value
In this algorithm, there are two main components Proposers(P) and Acceptors(A). There is also a listener(L), but it is less interesting for the majority of the discussion. Each nodes acts as all three. Proposers propose values based on client requests like depositing some money. Acceptors accept values and when there is majority/quorum, the proposed value is chosen and committed to the log. So, a proposed value goes through following states. Chosen state is the final state of value and there could be a lot of of back and forth between P and A before the value gets chosen. See figure 1 below.
Nomenclature on diagrams:
- N1, N2 are server nodes.
- Figures represent what each server is doing, from left to right, as time elapses.
- A dot generally represents a proposal of a value by the given node at the given time on the time scale.
- Propose/prepare and Accept/Choose are synonyms.
Figure 2 explains how a lack of majority can occur in a network. Since there is no majority, a chosen values doesn’t emerge.
Role of Proposer(P) and Acceptor(A)
Proposer proposes the value and acceptor accepts the value. Let’s define initial RPCs to accomplish consensus.
- P calls prepare(v) RPC on all As. Some A will return some values, but lets ignore those for time being.
- Once majority of A return, P can call accept(v) to those A (depending on the return values from 1). It is possible that return values don’t lead to a consensus and then P will have to do a new round of prepare(Vnew) again.
Evolving the algorithm
Based on the problem definition in the section titled “Consensus problem definition”, the algorithm need to have the following properties:
Any A has to accept the first value it receives to satisfy C1. This works out fine when there is only one prepare(v) call. If there are more than one value being proposed, then it can cause the issue mentioned in Figure 2 above — “Lack of majority”.
So to remedy this and land on some consensus, A must accept more than one value. This leads to the issue of previously chosen value being overwritten and that violates C2 as defined in the problem definition. See Fig. “value-overwritten”: in this N5 proposes v2, after a value v1 was already agreed upon by the majority.
So to remedy 2, in the same figure above, when N5 is about to propose/prepare v2, it can check if there was a previously chosen value. If there was, then it will propose(v1) and not v2. This is called the two phase protocol — Check first and then propose.
This solution still has some issues. These depend on when the N5 proposal begins. It can begin before any values are accepted or chosen. This scenario can occur on the proposer or acceptor. See Figure “acceptor-rejection”. In this N3 rejects v2, because a later proposal came in to overwrite already agreed upon consensus.
One more issue with the timing of the second proposal by N5 is as shown in the figure “older-rejection”. Here v2 lands on the quorum first and then v1 and N1:propose(v1) lands later. When N5 began, there was no consensus, so it cannot have known about v1, to propose it instead of v2.
To remedy that issue, node N3 needs to reject the older proposal i.e. the proposal that began earlier — N1:v1. This can be done by having a globally unique sequence number for each proposal:
- <counter><node-id> i.e. lower bits keep node-id and higher bits keep the counter that increases monotonically.
- Or Assign some disjointed sets of unique ids to each server node and then they can monotonically increase.
- Servers need to persist these values to disk, so that they can recover from the crash and always issue a higher proposal number from their end.
- Acceptors must remember on disk what is the highest numbered proposal and the value that they have accepted.
Formalizing the algorithm
There are two important items to track. One is the proposal number itself and second one is the value of the proposal. Please keep in mind as you read through, that the these two are different entities together.
- Proposer sends proposal number n and broadcasts it to all nodes using the prepare method.
- Acceptor looks at it and makes a promise to not accept any lower proposal i.e. number less than n. If acceptor has seen a higher proposal number then it rejects this proposal.
- Also once an acceptor accepts a proposal value, it will only accept the same proposal value from a higher proposal number.
- If majority of acceptors responded back, then Acceptor uses the value v of the highest proposal number that it got back. When majority didn’t agree, then the algorithm needs to go back to step 1.
- Proposer now creates an accept message with the number n and the value v and sends it to the acceptors that responded back in step 2.
- Acceptors will accept this proposal number and value, if they haven’t responded back to a higher proposal number already.
- The two phase approach of “prepare and accept” ensures that the only value that can win is the one that gets both prepared and accepted. As long as some P can get in its prepare and accept message out on the majority nodes, Basic paxos will return that value.
- Stalemate: The issue with his approach is that it can run into no-quorum scenarios still. This happens if the sequence is something like this and these messages land on node N3 and are sent by N1 and N5: N1:Prepare(1), N5:Prepare(2), N1:Accept(1), N1: Prepare(3), N5:Accept(2) . Here N5:prepare(2) overrides N1:prepare(1) and cuts off N1:accept(1). Before N5 can execute N5:accept(2), N1 comes back with a higher numbered proposal of N1:prepare(3) and cuts off N5:Accept(2).
How to fix the Stalemate
The stalemate issue in the last section seems like a big deal, but it can be solved somewhat easily. Nodes can use some random delays before executing prepare call. This will help with some proposer getting in its both prepare and accept calls. Although, doing this for every message might be a bit much. So Lamport suggests to use a notion of a “distinguished proposer”. The set of nodes can elect a distinguished proposer using the basic paxos(and random delays)itself and then make that the only node being the proposer node and thus avoiding the constant stalemates.
Scaling the Basic Paxos out to produce a replicated log
Going back to the earlier example of the replicated log in case of a bank, what needs to happen is that all server nodes need to agree on the state of the log. Each index in the log represents a transaction. Some of the transactions can and will occur in parallel and thus servers can accept multiple requests and commit those to different indices — as long as there is quorum. The key to ensuring the consistency of the bank accounts is that the state transitions should be applied sequentially — one after another.
So focusing on the replicated log:
- Each node can run an instance of basic paxos corresponding to every index in the log.
- Each node can select multiple indices simultaneously and reach consensus on the log-value for that index.
- A leader election can be useful to reduce the contention on proposed values
- A leader can batch prepare calls for multiple log indices into a single prepare call and thus eliminate most of the prepare calls.
- A system of heartbeat messages might be needed to discover that a leader is not around any more and then someone else needs to become the leader and proposer. The system can’t progress while the leader is down.
I found John Ousterhout’s explanation on Basic- paxos really useful. The figures in this notes are inspired by his presentation. He goes into further detail on Multi-Paxos and how configuration changes i.e. addition of servers can affect paxos, which i will not cover for now. In general, distributed-replicated log seems like a useful concept in many applications. It might be worth reading some more into practical implementations of this and what issues they ran into.