Paxos consensus for beginners

Ziliang
Ziliang
May 14, 2020 · 8 min read

Since its first publication The part-time parliament by Leslie Lamport in 1989, Paxos has been the core of distributed consensus algorithms and is notoriously difficult to understand. This passage aims to overcome this with simple explanations and examples.

Why do we need Paxos?

The very origin is that we want to make a single process fault-tolerant. Just as what is done to prevent data loss, redundancy is the most direct way — replicating the process with identical ones (typically distributed over a cluster), so that when some replicas fail, the others can still provide the service.

An obvious issue is that we need to make all the members in the group see the same client request sequence to ensure that no matter which process the client requests, the data returned is always consistent. Could we just let one process to be the “leader”, i.e., responsible for receiving all the client requests and broadcast to the others? Well, this is possible, but first think about the case where the leader crashes and a new leader is elected, and then the “old” leader resumes. There could be multiple leaders trying to convince the followers. And this is why we need Paxos — to reach a consensus on what the next client request is in the sequence.

Overview of Paxos

In brief, Paxos is an algorithm for choosing a single value among multiple ones. Here “choosing” means that all the members will see the same chosen value and the chosen value is indeed requested by a client. The key idea is that a majority represents the whole — if more than half processes choose a value, that value is the consensus.

Paxos is so important that almost all the later consensus algorithms are based on it, like Raft, ZAB(Zookeeper Atomic Broadcast), Cheap-Paxos, Fast-Paxos, and many industry software(like Apache Zookeeper, the foundation of Apache Hadoop & Kafka, and PhxPaxos, used for Wechat with a billion users) are built on it.

Let’s see when Paxos could work:

  • Messages passing through the network can be delayed, lost, out of order, duplicated, but not corrupted;
  • Fail-stop model: processes may fail by stopping, may restart, but not present Byzantine faults;i.e., processes would not operate in a malicious way.

These assumptions are practical and that’s one of the reasons why Paxos is so popular despite its difficulty.

My construction of Paxos

Let’s focus on a group of processes and introduce some terms. For each client’s request(value), there are some processes called proposers that handle the request and tell some other group members about it (the process of informing this value is called propose). There is a unique number attached to each value called proposal number for distinction. There are also some processes that want to know which value is chosen, called learners. Finally, there are processes that vote to choose the consent value, called acceptors (accept = voting for value and record the proposal number). Note that a process can have multiple roles; i.e., it can be a proposer, an acceptor, a learner at the same time. Though the numbers of acceptors and learners are the same as the group size in our consideration of process resilience (we want all processes to participate in the voting and learn the value), there are no hard constraints on the number of these 3 roles; e.g., there can be 2 proposers, 5 acceptors, and 3 learners.

It’s obvious that there must be more than one acceptor; otherwise, the voting can’t proceed when the only acceptor is down. So how many acceptors do we need? Well, it depends on our definition of consensus. If we want “majority = consensus”, the number of acceptors should be odd and larger than one, so that a majority can be always reached and whenever a majority of processes are on, the group can well function(for even numbers, there can be half yes and half no).

Now suppose the proposers receive the requests and propose them. How should these acceptors vote for an informed value? They must vote for the first informed value because chances are that only one value is proposed. What if they stick to the first informed value?

Example 1. No consensus when directly proposing, accepting, and sticking.

Here, each of the 3 processes S1, S2, S3 proposes a value(red/blue/green) by sending a message “accept?(value)” (also called a proposal), and each of their acceptors accepts it. Since they won’t change their minds, a consensus can never be reached. This means acceptors must either: 1) vote for/accept multiple values (never regret votes), or 2) there shall be a phase for negotiation before proposers propose their values, so that they can finally propose the same value.

Let’s consider 1): accepting multiple values. A new example crashes our hope:

Example 2. Multiple values are chosen when acceptors can accept multiple values.

Here, S1 proposes the value “red” first and it is accepted by S1 and S2. Since there are only 3 processes in total, {S1, S2} forms a majority. But before the acceptors sending this success to all the learners, S3 proposes the value “green” and also makes it through. Now there are two values that are voted by a majority, and thus two values are chosen! (Remember our goal is to pick only one value).

Then we have to consider the alternative 2): before proposing a value, each proposer must “negotiate” with other proposers so that they can finally propose the same value, and acceptors would stick to the first informed value. Let’s call this phase as “prepare phase”, where the primary goal is to know whether some values are already accepted. To do this, before proposing each proposer sends a “prepare message” to ask acceptors and waits until a majority of replies are sent back(a minority of feedbacks may not reflect the consensus). If from this majority, a proposer X knows that an acceptor Y has accepted a value V, then X knows that Y will never accept value other than V and it needs to change the proposed value; otherwise, something like Example 1 may happen again.

So what value should X adopts to propose? Note that from these replies X may find that there are different values accepted by different acceptors:

Example 3. A proposer receives different accepted values from a majority

Paxos simply decides that the value with the largest proposal number is adopted (so the proposal numbers must be total-ordered). But this may bring a new problem, just like the following example 4:

Example 4a. The history before that in Example 4b.
Example 4b. Different proposers see different adopted values and no consensus can be achieved.

Here, we still use color to indicate values but use numbers inside the parenthesis to indicate the proposal number (they can be any total-ordered objects besides natural numbers). Before one of “green”, “blue” or “red” is chosen, S2 and S4 start to propose. Though they both receive a majority of replies, S2 finds that it should adopt value “green” since 2 > 1, while S4 finds that it should adopt value “red” since 1 > 0. Then S2 and S4 accept (green, 2) and (red, 1), respectively, and no consensus can be achieved.

So how could we deal with cases like Example 4? The problem is that the “negotiation” is inadequate; though a proposer could know whether there is a value accepted, it can’t know whether or not other proposers are proposing different values at the same time. To deal with this, acceptors must record the prepare information of proposers and reply prepare messages with this information.

To be more detailed, we would need the following extra steps:

  • Attach the proposal number to each prepare message;
  • Each acceptor maintains a variable called max_proposal_number that stores the max proposal number seen so far; when later the acceptor receives a prepare or accept? with a smaller proposal number, it replies with a message indicating rejection.

We say an acceptor A promises to a proposer P whose proposal number is A’s max_proposal_number. Note that due to an extra prepare phase, the acceptors can know how many values are being proposed and thus can reject proposals now(remember at first we say an acceptor must accept the first value informed because it doesn’t know whether this is the only one).

With these modifications, the situation in Example 4a would not happen:

Example 5. Value “red” and “blue” will be rejected because the acceptors have promised to S1 with proposal number 2.

Finally, when the proposer hears from a majority of acceptors that its value is accepted, it knows its value is chosen and would notify all the learners (this is optional; some say the acceptors should do this job. see Rutgers University’s lecture note and MIT’s lecture note).

Congratulations! Now we have the complete Paxos algorithm:

This graph is based on https://www.youtube.com/watch?v=JEpsBg0AO6o&t=2794s

Note that:

  • When a proposer fails, it would retry until success(but would typically know a value is already accepted and adopts it). By replying max_proposal_num, the proposers could know it’s rejected and needs to pick a proposal number larger than that in the next try;
  • Acceptors need to store max_proposal_number, accepted_proposal, and accepted_value in disks for resume after crashes; Learners need to store the chosen value in disks. Proposers don’t necessarily need to do this because clients could resend requests.

Livelock

As you may sense from example 5, chances are that multiple proposers fight for acceptors’ promises and no progress can be made, just like the following:

Example 6. Livelock due to contention.

A simple solution is to introduce some random delay before each retry. A more common approach is to elect a leader, which will be covered in my further stories about Paxos.

Conclusion

We finally gain an algorithm that chooses a single value among multiple ones and ensures the following two properties:

  1. Safety
    - Only a single value could be chosen
    - Only the chosen value would be learned
  2. Liveness
    - Some proposed value is eventually chosen
    - Learners would eventually learn the chosen value

Despite the fact that message passing could be unreliable and roles(proposers/acceptors/learners) may stop and restart.

What’s Next

Thank you for reading. In the next article, I would show

  • How Paxos handles crash for proposers/acceptors/learners;
  • How could we use Paxos to reach consensus on a sequence of values (Multi-Paxos);
  • How to make a simple implementation of Multi-Paxos;
  • And more…

Reference

[1] Diego Ongaro, Stanford University’s Paxos lecture, 2013
[2] Lamport, Leslie. “The part-time parliament.” ACM Trans. Comput. Syst. 16 (1998): 133–169.
[3] Lamport, Leslie. “Paxos Made Simple.” (2001).
[4] MIT 6.824 2015 Spring Lab 3
[5] M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017, Chapter 8 Fault-Tolerance.
[6] Rutgers University CS417 Paxos note

Distributed Knowledge

Distributed Systems

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store