EPaxos Reimplementation and Evaluation

Kaya Alpturer, Eric Xue

Published in

Princeton Systems Course

12 min readMay 19, 2024

Introduction

We implement Egalitarian Paxos (EPaxos), a distributed consensus algorithm based on Paxos, and compare its performance in transcontinental networks to its performance in networks that only span a single continent. Due to increased regulation on the storage and transportation of digital data, especially across borders, companies may choose to optimize their networks for domestic services rather than international ones in the future. Consequently, it is important to understand the performance of distributed algorithms in networks that may be as wide as a single nation or a single continent but not as wide as several. We point out that at least a priori, a distributed algorithm optimized for the latter setting may not obtain the best performance tradeoffs in the former setting. For instance, an international network might need to sacrifice throughput to achieve the same latency witnessed by a domestic network simply because it is more geographically concentrated.

EPaxos

Iulian Moraru, David Anderson, and Michael Kaminsky proposed EPaxos in their SOSP’13 paper “There Is More Consensus in Egalitarian Parliaments’’ in order to address the desire for a consensus algorithm with low wide-area latency and high throughput. At the time of publication, other Paxos-based protocols such as Multi-Paxos, Mencius, and Generalized Paxos were widely adopted for the purpose of consensus. A shared feature of these protocols is that to achieve high throughput, they employ a stable leader through which all client requests and protocol messages must pass. As the authors point out, the use of a stable leader comes with several drawbacks. Because a disproportionate amount of the load is placed on the stable leader, protocols that employ a stable leader are inherently harder to scale, being limited by the computational might and network connectivity of the leader. Moreover, because all communication must pass through the leader, these protocols may witness high latency when the leader is geographically far from the rest of the network or the clients that are sending requests.

EPaxos addresses these concerns by eliminating the stable leader. Instead, under EPaxos, a client may submit a request to any replica (e.g., the one with the lowest latency), and the replica that receives the request becomes the command leader for that request. The command leader should not be confused with the stable leader, which is a designated replica in other Paxos-based protocols that receives every request. Moreover, in EPaxos, replicas reach consensus on a set of ordering constraints, or dependencies (conflicting requests), for each request, as opposed to the order in which requests should be processed itself. These constraints guarantee that any feasible ordering of the requests places all non-concurrent requests in the same order. Thus, replicas can use these constraints to independently derive the same ordering of (non-concurrent) requests.

Under EPaxos, there are two paths via which a request could be committed. If the command leader and F + ⌊(F+1)/2⌋ — 1 other replicas (where F is the number of tolerable failures) agree on the dependencies of a request after one round of communication, then the request (and its dependencies) can be immediately committed. This is called the fast path. If not, then another round of communication between the command leader and F replicas (a majority) is needed to ensure that every replica commits the request with the same dependencies. This is called the slow path. Note that the slow path can be seen as running the classic Paxos algorithm to agree on the dependencies of the request. We also point out that when there are 3 (F = 1) or 5 (F = 2) replicas, a quorum of F + ⌊(F+1)/2⌋ replicas is just a majority, so the fast path quorum is no larger than the slow path quorum.

Because any replica can become the command leader for a request, the load of consensus is spread out across all replicas, which should increase each replica’s throughput. Moreover, when the frequency of conflicts is low, most requests will be committed via the fast path after only one round of communication. This, combined with the fact that the fast path quorum is of optimal size, means EPaxos should experience lower latency compared to other Paxos-based algorithms.

Protocol

We describe the EPaxos protocol in greater detail. We first give the unoptimized version of EPaxos for which a fast path quorum of size 2F (out of 2F + 1) is necessary. Later, we describe a modification that reduces the size of the fast path quorum to F + ⌊(F+1)/2⌋.

For each request, EPaxos proceeds in three phases, with the second phase taken only if consensus was not reached in the first phase.

In Phase 1, the command leader sends to all replicas PreAccept messages containing

The received request
Any dependencies that the command leader is aware of
A sequence number that is used later for “tie-breaking” purposes when executing committed requests

The sequence number is taken to be greater than the largest sequence number among the dependencies that the command leader is aware of.

Upon receiving a PreAccept message from the command leader, a replica replies with a PreAcceptOK message containing

The request
Any dependencies that the replica is aware of (including those communicated by the command leader)
A possibly updated sequence number

The sequence number is updated only if the sequence number in the received PreAccept message is not strictly greater than the largest sequence number among the conflicting requests that the replica is aware of.

Upon receiving PreAcceptOK messages for the request from a majority of the replicas, the command leader checks if 2F — 1 of them contain the same sequence number and dependencies. If so, then the protocol enters the Commit Phase, and the request has been committed via the fast path. Otherwise, the protocol enters Phase 2.

In the Commit Phase, the command leader commits the request with the agreed upon sequence number and dependencies and tells the other replicas to do so as well via Commit messages containing the request and the agreed upon sequence number and dependencies. Upon receiving a Commit message, each replica will commit the request with the communicated sequence number and dependencies.

In Phase 2, the command leader sends to all replicas Accept messages containing

The request
The highest sequence number the command leader knows of (including those communicated by the other replicas)
Every dependence the command leader knows of (including those communicated by the other replicas)

Upon receiving an Accept message, each replica will reply with an AcceptOK message containing the communicated information. Upon receiving AcceptOK messages from a majority of the replicas (including itself), the command leader will enter the Commit Phase, and the request has been committed via the slow path.

We highlight that when there are 3 (F = 1) replicas, every request is committed via the fast path. Thus, we expect the latency of EPaxos with 3 replicas to be quite low.

To reduce the size of the fast path quorum to F + ⌊(F+1)/2⌋, at the end of Phase 1, in addition to checking whether or not the sequence numbers and dependencies communicated by the other replicas coincide with each other, the command leader must also check whether or not each dependence it knows of (including those communicated by the other replicas) has been committed by some replica in the fast path quorum. If these two conditions are met, then the protocol enters the Commit Phase. Otherwise, the protocol enters Phase 2.

The additional fast path condition ensures that the sequence number of every dependence is finalized and will not change in the future. Note that the additional information required to check whether or not the condition is satisfied can be communicated to the command leader by each replica with at most one bit per communicated dependence.

Note that the additional fast path condition in the optimized protocol is unnecessary when there are only 3 (F = 1) replicas since in this case, the size of the fast path quorum is the same in the optimized and unoptimized versions of EPaxos, and the unoptimized version of EPaxos already ensures safety.

To execute a committed request, each replica locally does the following:

Build a dependency graph in which there are directed edges to each request from its dependencies
Iterate over the strongly connected components of the dependency graph in inverse topological order, and for each strongly connected component, execute its unexecuted requests in increasing order of their sequence numbers

Implementation

We implement EPaxos in a replicated key-value store in Rust v1.78.0. Clients can request to read the value stored in a particular key and to write a value into a particular key. We use smol to support concurrent operations and Rust’s mutual exclusion primitive Mutex to support atomic operations. We implement messages as a custom data type consisting of

A message type (e.g., PreAccept, PreAcceptOK, Commit)
A client request
A sequence number
A dictionary that maps each dependence to a bit indicating whether it has been committed or not
A tuple consisting of the command leader’s ID and the instance number.

We use serde to serialize and deserialize this custom data type into and from strings, which are then sent between clients and replicas via TCP. We use petgraph to support computation on graphs (for the execution algorithm). The source code for our implementation and experiments can be found here: https://github.com/kalpturer/cos518-2024-project.

cos518–2024-project/src/network/client.rs implements a client that sends read and write requests.

cos518–2024-project/src/network/replica.rs implements a replica running EPaxos to implement a replicated key-value store.

cos518–2024-project/src/main.rs configures and initializes each client and replica according to the command line arguments supplied by the user.

Evaluation

As in the original SOSP’13 paper, we evaluate the wide-area latency and throughput of EPaxos in a replicated key-value store application with three and five replicas. In our application, every client request is a write. We deployed our replicas on Amazon EC2 t2.micro instances.

In our domestic experiments with three replicas, the replicas span 3 Amazon EC2 data centers in California, Ohio, and Virginia. When there are five replicas, our replicas additionally span 2 more Amazon EC2 data centers in Oregon and Canada.

In our international experiments with three replicas, the replicas span 3 Amazon EC2 data centers in Virginia, Tokyo, and London. When there are five replicas, our replicas additionally span 2 more Amazon EC2 data centers in Sydney and Brazil.

In each of our experiments, each replica has a single co-located client that sends write requests at a specified conflict rate and measures the execution latency for each request. We evaluate at 0%, 2%, and 100% conflict rates. Between each write request, clients sleep for a specified amount of time. We tune the amount of time so that the network is always saturated with client requests (but not so saturated that the replicas cannot keep up). In other words, the latency we report is for a max throughput workload. For our three replica experiments, the sleep time was 25ms for conflict rates 0% and 0.2%, while we used 100 ms for experiments with conflict rate 100%. For our five replica experiments, the sleep time was 40 ms for conflict rates 0% and 0.2%, and 150–200ms for conflict rate 100%. For each of our experiments, we report the median latency (ms) at each client and the average throughput (number of executed requests per second) per replica at various conflict rates. We also plot the latency distribution at each client and the average throughput at the specified conflict rates.

Median latency (ms) at each client in our domestic experiments with three replicas at 0%, 2%, and 100% conflict rates

Median latency (ms) at each client in our international experiments with three replicas at 0%, 2%, and 100% conflict rates

Median latency (ms) at each client in our domestic experiments with five replicas at 0%, 2%, and 100% conflict rates

Median latency (ms) at each client in our international experiments with five replicas at 0%, 2%, and 100% conflict rates

Average throughput per replica (reqs/sec) in our domestic and international experiments at 0%, 2%, and 100% conflict rates

We make several qualitative observations. We see that at each number of replicas and conflict rate, latency is higher when replicas span several continents than when they only span a single nation or continent, while throughput remains comparable. The decreased latency in our domestic experiments is due to the fact that the replicas are more geographically concentrated, so the nearest replicas in a majority quorum are significantly closer to the command leader. The consistent throughput suggests that latency was the bottleneck of our system, i.e., requests could only be executed at the rate at which messages could be sent.

We see that clients that are more centrally located with respect to the rest of the network have lower latency, ostensibly because latency is limited by the farthest replica in a majority quorum.

We see that at each number of replicas, throughput is comparable when the conflict rates are 0% and 2%. However, throughput drops dramatically when the conflict rate is 100% for both three and five replicas. If we only look at the data for the five replica experiments, we might be tempted to explain this drop by appealing to the fact that when there are five replicas and many conflicting requests, EPaxos most likely takes the slow path for many of them, i.e., many requests require an additional round of communication. This likely explains the decrease in throughput in the international setting when we move from three to five replicas (since an additional round trip in the international settings dramatically increases latency) but not the drastic drop in throughput at 100% conflict rate (since three replicas witness the same trend yet they never require an additional round trip regardless of the number of conflicts). Instead, we believe the drop in throughput when we move from 2% to 100% conflict rate is due to the execution algorithm: as the number of conflicts increase, so does the time it takes to build the dependency graph, sort its strongly connected components in inverse topological order, and execute the requests. In other words, at some conflict rate between 2% and 100%, the execution algorithm replaces latency as the bottleneck.

Latency being the bottleneck in both our domestic experiments offers an explanation as to why throughput increases when we move from three to five replicas (the exception being when the conflict rate is 100%, which we already addressed in the previous paragraph). Notice that as the number of replicas increases, the median latency actually decreases. This is likely because the nearest replicas in a majority quorum actually get closer geographically since we are adding more replicas within a confined area.

To explain the increase in throughput when we move from three to five replicas in our international experiments, we appeal to the latency distributions. Notice that when there are five replicas, the latency of the top three replicas is more concentrated, so even though the median latencies are higher when there are five replicas, it is possible that more requests are being serviced at more reasonable latencies when there are five replicas than when there are three.

Latency (ms) distribution for our domestic experiments with three replicas at 0%, 2%, and 100% conflict rates

Latency (ms) distribution for our international experiments with three replicas at 0%, 2%, and 100% conflict rates

Latency (ms) distribution for our domestic experiments with five replicas at 0%, 2%, and 100% conflict rates

Latency (ms) distribution for our international experiments with five replicas at 0%, 2%, and 100% conflict rates

Conclusion

We implement EPaxos in a replicated key-value store application using Rust. We evaluate the latency and throughput of our implementation in two wide-area settings: domestic and international. We find that latency is the main bottleneck of our implementation, suggesting that consensus algorithms optimized for very wide-area applications may perform well in more geographically concentrated applications.