Reproducing EPaxos with Docker

Published in

Princeton Systems Course

6 min readMay 14, 2019

Jinzheng Tu and Yu Zeng

Why EPaxos?

The Paxos Parliament had to function even though legislators continually wandered in and out of the parliamentary Chamber. — Lamport

Consistency protocols have long been an important research topic, even before the era of big data. The most widely used protocols are Paxos and its variant, Multi-paxos. But both these two have drawbacks in performance. Paxos has high latency and low throughput due to its two round trip for every message. Multi-paxos solves these issue by having one leader serving multiple messages. But as a natural result availability becomes a problem.

The EPaxos aims to solve all the latency, throughput and availability problems. It works quite differently from the two above. In both Paxos and Multi-paxos, all the requests are in a single line in sequential order agreed by all the replicas. While in EPaxos, the number of lines equal the number of servers, like what is shown in the following figure.

That is because every server is the leader for the request that it received. This proves to have benefits in every aspect of performance. For the latency and throughput, unlike Paxos, the servers do not contend to be the leader, saving a large number of time. Also, as proved in the paper, if system is in good condition, the leader can choose a fast quorum which takes only one round-trip. On the other side, multiple servers can be leaders at the same time. This balanced the workload across multiple servers and improves availability compared with Multi-Paxos.

But nothing comes at no cost. Though the two-dimension array brings improvement in performance, it makes it harder to execute the commands. This is also a natural result of two-dimension array. The commands are put in multiple arrays, and it is not clear from their positions in the arrays which come earlier than others.Therefore, every command should be stored together with information about their dependencies. And then during the execution, one command cannot be executed until its dependencies have been executed.

How did we implement it?

We follow the classical software engineering procedure to implement the EPaxos, with slight simplification: requirement analysis, system design, system deployment.

Requirement analysis

We start by identifying use cases, graphically depicted as below: the EPaxos subsystem shall incorporate PUT and GET semantics while offering probe (for potential debugging purpose).

What’s inside a PUT request?

A PUT request is first issued by a client to an arbitrary server (it’s denoted as leader). The leader then send PreAcceptMsg to other servers (denoted as auditors). Each auditor respond such message PreAcceptOKMsg. The leader, when received enough such messages, may choose whether such message qualifies the fast path (where the Accept phase can be skipped) or the slow path. In the slow path, the leader send AcceptMsg to auditors and wait for response. When the leader receives enough AcceptOKMsg (or the Accept phase is skipped), it write the message to disk (make persist) and broadcast CommitMsg to auditors, which, upon receive, also persist such message on their disk. The whole process can be (mostly) formally described as below using Petri Net.

What’s inside an EPaxos server?

A server must serve as both leader and auditor. It also need to send and receive consistency messages from other servers. Also, It should make all committed messages persisted by writting to disk. We separate all these into different components inside an EPaxos server, with their dependency shown below using UML component diagram.

So, how is the code organized?

The whole program (auxiliary scripts excluded) are written in well-formatted Go. We extracted the common code across server and client into a separate package common. All these Go packages are organized as follow.

And how do we deploy the whole system?

Due to the limitation of cost and time, we only used one single device to test the system. We created a single docker image with the compiled server binary (artifact) and launched multiple containers of it on the testing device.

The compiled client binary is directly placed on the device. Pumba is used to emulation network chaos.

How is the performance?

We mainly focus on latency and throughput measures. In our experiments, we used 5 servers. The five servers are divided into two groups and servers of the same group are located nearby. The network latency within servers in the same group are 0.5ms for round-trip, and 150ms across the groups. The reason that network latency with a group is 0.5ms is just a technical issue. We are suing pumba to add latency to networks and pumba only allows us to add one latency value to each server. The layout of the five servers are shown below.

The first result is the delay observed from 5 clients. Each client communicates with a corresponding server and measures the commit latency of its request. The graph is shown below.

We can see that the server 0 and server 1 have the close latency and much higher than the other servers. That is because in they are in the minority group, they have to communicate with at least one server in the other group, which has a large latency. The gap between latencies is about 150ms, which proves our insight. The latency for the other servers are mainly for writing disks.

The second experiment is about throughput.

Shown in the graph is throughput measured for server 4. At time 50s, we turned off the server 2, which is in the same group as server 4. We can find that the throughput does not drop obviously, which means our system has good robustness against server failures. The variation of throughput becomes larger, because now server 4 has to communicate with servers in another group. At time 76s, the server 2 comes back and throughput is recovered.

In the next experiment, we turned off 3 servers together: server 1,2,3. Because majority of servers are turned off, the system stops working and throughput goes to 0. When server 1 comes back at time 80s, the system begins to work again, which is the same as our expectation.

Also, our system keeps working under different kinds of environment variations, such as package loss, up to 1s network delay, random killing servers, and 0.1% package duplication.