XFT: Reimplementing XPaxos in Go

Suriya Kodeswaran
Princeton Systems Course
8 min readMay 14, 2019

Suriya Kodeswaran and James Heppenstall

“XFT: Practical Fault Tolerance beyond Crashes” [2016] introduces a new system model, Cross Fault-Tolerance (XFT), and a corresponding consensus protocol, XPaxos. XPaxos builds upon its namesake, Paxos [1988], which offers a solution to the State Machine Replication (SMR) problem. In Paxos, a number of RPC calls are made before consensus is reached amongst all replicas. Paxos is a Crash Fault-Tolerant (CFT) consensus protocol in that it provides availability guarantees if and only if at least half the replicas in a cluster have not crashed. Paxos, however, has a major limitation (aside from being notoriously difficult to comprehend): it is not tolerant to Byzantine or arbitrary faults.

Byzantine faults are a relatively common occurrence in the real-world, arising from various sources including malicious adversaries, faulty hardware, and software bugs. Generally, they can be thought of as non-crash faults, where communication with a replica may remain intact, but messages originating from that replica are corrupted in some way. Consensus protocols that are robust to Byzantine faults are said to be Byzantine Fault-Tolerant (BFT) and are universally more complex than their CFT counterparts because they require more RPCs, cryptographic primitives, and additional redundancies to tolerate arbitrary failures. As a consequence, BFT protocols lose availability after a fewer number of faults than CFT protocols. We lean on the seminal work “Practical Byzantine Fault Tolerance” (PBFT) [1999] when discussing this category of protocols.

In the face of only trivial crash faults, XPaxos desires to run as efficiently as Paxos and maintain the same availability guarantees. When non-crash faults are introduced, XPaxos strives to outperform PBFT. The key idea here is that CFT is too optimistic while BFT is too pessimistic. In order to leverage the benefits of both models, XFT introduces some important restrictions and assumptions:

  1. Clients and replicas can suffer from Byzantine faults
  2. All replicas share reliable bi-directional communication channels
  3. An eventually synchronous network model

These are strong assumptions to make, especially the last, making XFT unsuitable for most use cases. That said, there are niche settings where XFT is useful such as SMR within data centers, where network operators have fine-grained control over the network. Before proceeding, however, let’s dive into the last assumption. Eventual synchronicity assumes that clients and replicas can reliably communicate within some parameterized time frame Δ. In the case of XPaxos, we assume that there always exists a majority of replicas that can engage in synchronous communication. We call this majority the synchronous group.

Let’s now compare the availability guarantees of our different SMR protocols. Here, n represents the number of replicas in the system, and the values in the cells represent how many faults of each kind can be tolerated before availability dissolves. As you can see, XPaxos succeeds in being more available than Paxos and PBFT. That said, this is only if the prior assumptions and restrictions listed above hold.

Figure 1: Maximum number of faults before availability guarantees dissolve for the SMR protocols under evaluation

Next, let’s dive into the specifics of XPaxos. One important thing to note is that PBFT and Paxos both follow similar message patterns to XPaxos, allowing us to effectively theorize how the protocols compare, particularly with regards to memory usage and runtime. We consider the SMR and view change protocols of XPaxos.

SMR Protocol

XPaxos’ SMR protocol relies on four RPCs: Replicate, Prepare, Commit, and Reply. At a high level, the system cycles through a sequence of views, each of which uniquely defines a leader and a synchronous group. A client sends a request via a Replicate RPC to the leader of the current view. The leader will log said request and then broadcast Prepare RPCs to all nodes in its synchronous group. These nodes will also log the request and broadcast Commit RPCs to all nodes in their synchronous group. Upon receiving Commit RPCs from all nodes within the synchronous group, a node sends a Reply RPC to the client. The client confirms that its request is committed upon receiving Commit RPCs from a majority of nodes within the cluster.

View Change Protocol

XPaxos’ view change protocol again relies on four RPCs: Suspect, View Change, VC Final, and New View. We refer readers to the original paper for a detailed description of the view change protocol. That said, it is worth noting that the view change protocol will be initiated if a replica within the synchronous group becomes suspicious of another replica. This may occur if an RPC times out (remember we are assuming an eventually synchronous network model) or if the cryptographic primitives of any RPC fails. In such an event, the view change protocol will proceed to update the view number and consolidate a new synchronous group of strictly non-faulty replicas.

Implementation

We reimplemented the SMR and view change protocols of XPaxos, as described in the paper, in Golang. Specifically, we designed an xpaxos package that contains all necessary client and replica code. Due to the eventually synchronous network model in XFT and the need to tune a time frame Δ, we also designed a network package (built on Golang’s net/rpc package) that defines a “dummy” network of clients and replicas. This allows us fine-grained control over all aspects of the network, in particular Δ, in order to perform testing and benchmarking. You can find our code on GitHub.

We verified the correctness of both the CFT and BFT properties of XPaxos using Golang’s testing package. In other words, we designed a number of scenarios to stress test the SMR and view change protocols. Our stress tests show that a time frame between Δ=50ms and Δ=100ms is optimal. For more information, check out our readme and src/xpaxos/test_test.go in the Git repository (feel free to fork it and run these tests yourself!). We then ran evaluation benchmarks on different XPaxos clusters running on a local computer. Although it would have been nice to collect results on AWS or Azure instances, the need for fine-grained control over the network meant that this was out of the question. Our key evaluation metrics include throughput vs. latency and fault recovery times.

Test Setup: 2.7 GHz Intel Core i7 processor, 16GB 2133 MHz LPDDR3, Δ=50ms, t=1,2,5, maximum of 4 concurrent Go processes

For evaluation purposes, we modified an open-source implementation of Paxos and reimplemented a heavily abridged version of PBFT. These were designed to allow us to compare the performance of XPaxos to Paxos and PBFT under the SMR protocol (i.e. when no faults occur). We therefore designed additional paxos and pbft packages and again used Golang’s testing package to run stress tests and evaluation benchmarks.

Evaluation

We begin by evaluating the performances of XPaxos to Paxos and PBFT. We consider clusters of each protocol that tolerate up to t=1,2,5 crash faults. Recall that Paxos and XPaxos require n=2t+1 servers to provide such guarantees, while PBFT requires n=3t+1 servers. We measure throughput as the number of 1kB requests per second and measure the average latency (in ms) that each protocol incurs when committing a single request. The protocols always reach a saturation point represented by the vertical asymptote present at the right of every data line, wherein the throughput is maxed out. As expected, both Paxos and XPaxos far outperform PBFT in all three cases, as they require significantly less RPCs and fewer responses from replicas before the algorithms proceed to the next round. We hoped XPaxos would perform as well as Paxos, as shown in the original paper. That said, our implementation of XPaxos performs quite a bit worse than Paxos in all three cases. We attribute this to the lack of optimizations present in our implementation, which the authors of the original paper claim significantly improve performance. We will detail these in the next section.

Figure 2: Throughput vs. Latency (No Faults)

Next, we measure the resilience of XPaxos to different levels of faults. The following graphs capture the same information: regardless of the cluster size, XPaxos is able to identify crash faults and quickly initiate a view change. During the view change protocol, we suffer a brief period where the system is down and cannot accept any client requests, but it will quickly reconcile the view and come back up to effectively the same throughput as before the crash occurred.

Figure 3: Time vs. Throughput (Single Crash Fault)

We finally measure fault recovery times of different XPaxos clusters under three scenarios: 1 crash fault, 2 crash faults, and 1 Byzantine fault. The latter involves randomly flipping bits in messages before passing them to the RPC handler. In the case of 1 crash fault, we were surprised to find that the cluster resilient to t=5 faults recovered quicker than the comparatively smaller clusters resilient to t=1 and t=2 faults, respectively. This, however, is likely due to our implementation of the view change protocol, in which each view uniquely defines a synchronous group as a random permutation of replicas produced by a pseudo-random number generator with seed equal to the view number. As such, it is possible that the larger cluster was able to find a valid synchronous group faster than the two smaller ones. As expected, 2 crash faults take about twice the amount of time to recover from than a single fault. Finally, 1 Byzantine fault had, on average, a faster recovery rate than 1 crash fault. This was likely because crash faults are only detected after a timeout period, whereas non-crash fault (in our tests at least) can be detected immediately by the cryptographic verification step in each RPC handler.

Figure 4: Fault Recovery Times

Further Work

The authors of XPaxos mention many quality of life optimizations that can be added to compete with Paxos’ runtime. Among these are:

  1. Fault Detection Protocol: Introduces a new RPC to determine if too many replicas have failed, implying the system is no longer available.
  2. Checkpointing and Garbage Collection: Integral to any consensus protocol, this will reduce the memory each replica needs to maintain.
  3. Lazy Replication: Why replicate today what you could have replicated yesterday?
  4. Batching/Pipelining: Mentioned as a significant improvement by the authors of the original paper, this allows multiple messages to be signed at once, reducing the number of expensive cryptographic hashes.

Conclusion

In summary, we reimplemented the SMR and view change protocols of XPaxos. While the protocols themselves do not appear particularly complicated, we quickly discovered that it is difficult to take a protocol written on paper and replicate it in a working system. In particular, we ran into a litany of issues common to any distributed system, such as deadlocks, livelocks, and concurrent memory accesses. Overcoming all of this, we believe that we verified the correctness of XPaxos and proved that, in certain network environments, it should be given a fair chance as the consensus protocol of choice. We finally found it very refreshing to think of the SMR problem from outside the confines of the traditional CFT/BFT camps.

--

--