Ang Li and Robin Qiu
Egalitarian Paxos (EPaxos) is a distributed consensus algorithm based on Paxos. It was invented by Iulian Moraru, David G. Andersen, and Michael Kaminsky from CMU and Intel Labs, and was first introduced in their SOSP’13 paper “There Is More Consensus in Egalitarian Parliaments”.
EPaxos achieves three goals:
(1) optimal commit latency in the wide area;
(2) uniform load balancing across all replicas, thus higher throughput;
(3) graceful performance degradation when some replicas become slow or crash.
EPaxos achieves these goals by ordering commands dynamically and in a decentralized fashion: when voting for a command in a command slot, each participant attaches ordering constraints to that command. Therefore, each replica can use these constraints to independently reach the same ordering.
II. Overview of Algorithm
EPaxos is aimed at breaking the algorithmic limitation of other Paxos-based consensus algorithms, the limitation that a single leader (or master) node is required to coordinate operations. The paper claimed that leader-based consensus algorithms have some disadvantages: 1) leader node has to process much more messages than other nodes; 2) under geo-replicated environment, accessing remote leader incurs high latency for clients; 3) other algorithms cannot deal well with load spikes and network delays; 4) single-leader design harms availability because of costly leader-election when the leader fails.
EPaxos breaks this limitation by decentralizing the coordination process. A client can send a command to any replica, and that replica “leads” the committing process of the received command. EPaxos supports two ways of committing: the fast path for non-interfering commands, and the slow path for dependent commands. As implied by the name, fast path requires less network communications and is hence faster.
The evaluation in the paper focus on the comparison between EPaxos and other Paxos-based algorithms such as Multi-Paxos, Generalized Paxos, and Mencius. However, we decided not to reproduce these results since we do not have comparable implementation of these algorithms.
Instead, we decided to implement basic EPaxos in Go and compare its performance with the Go implementation of Raft from Princeton COS418. Our EPaxos implementation is based on the same underlying library as our Raft implementation. We also built a key-value store service upon EPaxos (called kvepaxos), which provides the same API as the Raft-based key-value store (called kvraft) from COS418. Therefore, we believe it is more meaningful to compare these two implementations.
We emulated a geo-replicated testing environment on a single machine (Macbook Air 10.12.4, 1.7GHz Intel Core i7, 8GB 1600GHz DDR3). Each server/client runs in a Go routine. As shown in the following figure, we have N = 5 replicas (A through E) located in two different areas. The round trip time (RTT) between two replicas within the same area (e.g. A and B) is set to 20 ms. The RTT between two replicas in different areas (e.g. C and D) is set to 150 ms. Also, there are two clients co-located with each replica (10 in total). There is no network RTT between a server and a client associated to it.
Both EPaxos and Raft will be able to tolerate F = 2 failures. The basic EPaxos algorithm described in the paper (the version that we implemented) uses a simplified procedure to recover from failures, and as a consequence, its fast-path quorum size is 2F = 4. (The fully optimized EPaxos reduces this quorum to F + lower[(F+1)/2] = 3.) The slow-path quorum size is F + 1 = 3. The quorum size of Raft is also F + 1 = 3.
Our implementation of the basic EPaxos algorithm and the key-value store on top of it are about 1300 lines of Go code in total. The configuration and testing code are about 500 lines of Go code. You can find our source code at https://email@example.com/robinyqiu/cos518.git
We compared the commit latency of EPaxos and Raft. We measured the end-to-end latency, that is the time elapsed since a client starts to send a Get/Put/Append RPC to a server, until that client get a successful reply from the server. The median of commit latency of Raft and EPaxos are shown in the following figures.
In Raft, after a leader was elected, all client requests go through the leader. Commit latency of a request consists of two major components: (a) RTT between the sending client and the leader, plus (b) RTT between the leader and two nearby replicas, in order to reach consensus. RTT(a) depends on the location of the client, and RTT(b) is the same for all clients. Therefore, from the figure we can tell the leader is server C, because its co-located clients has the lowest latency (RTT(a) is 0ms). The two closest replicas from the leader are A and B, so RTT(b) is the RTT within Area 1 (20ms). D and E have higher commit latency than A and B, because their client are further away from the leader C (RTT(a) of D and E is 150ms; RTT(a) of A and B is 20ms).
In contrast, there is no stable leader in EPaxos. Therefore, RTT(a) of EPaxos is actually the RTT between the sending client and a random replica, and RTT(b) is between the server handling the request and three other replicas (fast-path quorum is 2F = 4 for basic EPaxos). In general, the commit latency for basic EPaxos is higher than Raft in our experiment, because a server must communicate with at least one remote replica in order to reach consensus (A EPaxos quorum of size 4 must cover both Area 1 and Area 2). We believe if the fully optimized version of EPaxos is used for evaluation, the commit latency will become much lower. Since there is no leader in EPaxos and a client can send requests to any replica, the commit latency does not highly depends on the location of the sending client.
In the paper, the authors showed that EPaxos has a higher throughput than other Paxos-based algorithms. This is because EPaxos uses the optimization of thrifty operation (Section 6.2 in the paper), and thus it processes fewer messages per command. Our basic implementation of EPaxos mainly focus on correctness and understandability, so it does not have the thrifty operation optimization. In addition, we emulated the testing environment on a single machine. Emulated clients and servers share the CPU and memory resources of the machine, so we are not able to test the real throughput. Therefore, we only evaluated the relative throughput of Raft and EPaxos. Our result shows that the throughput of Raft is 1.9 times the throughput of EPaxos. We believe the fully optimized EPaxos will also achieve higher throughput.
3. Performance Degradation under Failure
When testing the performance of Raft and EPaxos under failure, we first let the programs run for 30 seconds, and then shut down two servers randomly. The programs keep running with two failed servers for another 30 seconds, and then we restart the two failed servers, and let them run for another 60 seconds.
In the Raft throughput figure, we can clearly see that the throughput drops after two servers crash, and goes back up after they are restarted. The drop is because, in our experiment, one or two servers in Area 1 are down. So the leader in Area 1 must contact at least one remove server in Area 2 to reach consensus. After the two servers wake up, the three servers in Area 1 can form a quorum again, so throughput goes back up. We can also observe a slow decreasing trend of throughput. We think this is because as our program runs, it creates more and more Go routines, and thus exhaust system resources slowly.
The throughput of EPaxos remains slow during our evaluation. At the 30 second, it suddenly drops, because shutting down the servers causes some clients to resend their requests. Then throughput quickly goes up and remains stable as before. This is because no matter there is failure or not, it always take replicas in both Area 1 and Area 2 to participate, in order to reach consensus.
- Deadlock Case
During implementation, we found that Epaxos may end up with a deadlock in some cases. However, it is possible that the paper omitted some part of the algorithm which the author(s) consider common knowledge in the field (for example, “As in classic Paxos, every message contains a ballot number. For simplicity, we represent it explicitly in our pseudocode only when describing the Explicit Prepare phase in Figure 3.” — Section 4.3.1 in the paper). A simple example of the deadlock is shown below:
In the graph above, blocks of the same color are the same instance. Each block contains 4 fields: the first field is the state of the instance: “PA” stands for pre-accepted state, “A” for accepted, “C” for committed. The second is the command, the third the sequence number, and the last the dependence list.
As shown in the graph above, server A and server D start two different commands at about the same time. The Pre-Accept requests reach server B and server C in different order, so server B believes command C2 depends on C1, while server C believes C1 depends on C2. In this case, server A and server D cannot go through the “fast path” and commit the command, so they take the union of the dependence list of all replies and choose the maximum sequence ID, then update the attributes of the command instance accordingly. After that, server A and server D send Accept requests to server B and server C. Server B and server C will accept both the updated C1 and the updated C2, which results in a dependence loop, equivalent to a deadlock.
We think the paper omitted part of the function of the sequence ID associated with each command. In the paper, the sequence ID is only used when executing committed commands. In our implementation of EPaxos, a server checks the sequence ID of the newly received Accept request, and reject the request if the sequence ID is less than or equal to the largest sequence ID the server has ever seen.
The solution seems intuitive. However, we are actually running two classic Paxos independently. The Ballot number is used to reach consensus on the command, sequence ID and dependence list for an instance, while the sequence ID is used to sequentialize all commands on the same object. The Ballot number mentioned in the paper itself cannot solve the problem.
2. Algorithm Complexity
One major drawback of the algorithm is its complexity. From our understanding, EPaxos is more like a two-dimensional Paxos on top of an eventual-consistency-like layer.
First, each instance contains a dependency list. The motivation comes from the possibility that the order of two uncommitted commands has not been decided when a new command is received by any server in the consensus group. This is much like eventual consistency, where several values are allowed to co-exist with different version IDs. The difference is that, EPaxos solves this conflict before responding to the client with commitment, while real eventual consistency model exposes the conflict to the user.
Second, EPaxos runs two orthogonal Paxos for different purpose. In EPaxos, each server can become a leader of a command, and other commands may depend on the command before it is committed. Therefore, it is possible that a server does not know about some of the commands which a newer command depends on. In this case, EPaxos allows this server to take over the instance containing the unknown command and run consensus protocol for it. The Ballot number mechanism is used in this case to distinguish different proposals for one instance.
As mentioned in the previous section, sequence ID of a command is used to sequentialize all commands. It is also Paxos, where each server promises not to accept any command equal or less than the largest sequence ID when accepting a new command.
We only implemented the basic EPaxos, and it is already much more complicated than Raft. We believe the fully optimized EPaxos will be even more complicated. In practice, complicated algorithms are hard to design, optimize, and maintain, making them less likely to be adopted in production.