Implementing Practical Byzantine Fault Tolerance
A project for COS 518 at Princeton University, Spring 2018
Overview
For our graduate class on Computer Systems (COS 518) at Princeton University, we (Felix Madutsa, Andrew Spencer and Avthar Sewrathan) were tasked to implement some real world system for our final class project. We chose to re-implement a practical byzantine fault tolerant (PBFT) consensus algorithm from the 1999 paper on the same topic by Miguel Castro and turing award winner Barbara Liskov. For our project, we re-implemented the PBFT consensus algorithm in Golang as a library and tested it on a simple key-value store system with servers from AWS EC2. The code can be found at this public repo.
Motive
The motivation for this project is twofold. Our team initially wanted to do a project related to the computer systems involved in Blockchain and Cryptocurrency computer systems and wanted to replicate results related to sharding and scalability found by Luu, Narayanan et al. in the paper A Secure Sharding Protocol For Open Blockchains. It turned out that the paper was being developed into Zilliqa, a new cryptocurrency. However, in their paper, the authors mentioned that they had trouble finding a open source PBFT library to use and we wanted to attempt a solution to the problem. Therefore, this project aims to re-implement the PBFT algorithm first introduced by Castro and Liskov in their 1999 paper titled Practical Byzantine Fault Tolerance.
What is Practical BFT?
At a high level, a practical byzantine-fault tolerant consensus algorithm, is a consensus algorithm that can withstand byzantine faults. The notion of the Byzantine fault dates back to the Byzantine General’s problem and generally refers to the same person giving two different commands to different actors. In the context of computer systems it refers to a server appearing as both functioning and failed to different observers. To read more about Byzantine Fault tolerance, see this helpful article.
As mentioned above, the seminal paper on Practical Byzantine Fault Tolerant consensus is by Castro and Liskov. In the paper they they describe how their algorithm is able to replicate state across many machines and tolerate byzantine faults. Moreover, their algorithm is practical because it functions under the assumption of asynchronous network communication, where messages may be undelivered, duplicated, delayed or out of order. Another aspect of practical BFT is the various optimizations Castro and Liskov make to ensure performance. Castro and Liskov’s PBFT algorithm can withstand F node failures, assuming there are 3F+1 nodes in the network and assuming nodes fail independently. Lastly, they use cryptographic signatures in order to verify the identity of nodes, so that nodes do not accept messages from malicious actors.
PBFT System Design
The design of our system is slightly different from that presented in the paper on which this work is based on. Specifically, as shown in Figure 1 unlike in the original design where the consensus layer is responsible for executing the commands and keeping the entire application state, our design introduces another layer to take care of the keeping state and executing commands. As such the consensus layer is just responsible for agreeing on the ordering of commands. This enables the servers-layer to be swapped easily for different applications.
All communication between servers and clients is through signed messages using the private key of the sender and verified by the receiver using the public key of the sender. The consensus protocol has three phases, which are the pre-prepare, prepare, and commit phases, as shown in Figure 1. The client sends a signed request to the server it believes to be the primary. If the server is not the primary, it forwards the command to the primary, which assigns the request a unique sequence number. The primary makes a pre-prepare message and sends it to replicas to start the protocol to atomically multicast the request to all the replicas. Upon receiving a valid pre-prepare message, each replica makes a prepare command that it sends to all the other replicas, including the primary. After receiving 2F + 1 valid prepare messages, including its own, a replica marks the request prepared and sends a commit message to all other replicas. After receiving 2F + 1 valid commit messages including its own, each replica marks the request as committed, execute the request on its server and sends the reply back to the client. The client waits for F + 1 agreeing replies before accepting the results it gets from the servers. Since the commands are executed atomically, they are buffered by the order in which they are received by the primary; the algorithm behaves in the same way as a single centralized server, and the request are executed one at a time.
Leader election: To maintain liveness, the system moves through a series of views, and for a system with N total servers and at a view, v, the primary is server whose ID equals v mod N. View changes are used to choose a new leader when replicas think that a primary has died. For detailed description of how this is done, please refer to the original paper.
Garbage collection: In addition to the state relating to the actual data that of the application, the system keeps a log of commands that will have been executed. This log needs to be trimmed after a while when the commands have been guaranteed to have been executed and the system uses a process of checkpointing, to remove any unnecessary log entries. Checkpointing is implemented in the same way as presented in the original paper so please refer to it for detailed description.
Our implementation
We implemented the PBFT library in around 1500 lines of Golang code and used external libraries for cryptography and hashing.
The above table summarizes how our implementation attempt differs from that one outlined by Castro and Liskov.
While we both implemented the 3 phase consensus protocol, our protocol differs in several ways. First, as outlined above, the paper describes their BFT executing commands, whereas we have stronger decoupling between the BFT server and the server above it, allowing commands only to be executed by the server on top and using the BFT only for consensus. Secondly, the paper has 2 transport layers for the prepare message, where we only use 1 for simplicity sake. In the paper they use UDP and UDP with multicast, where as we used TCP due to the Golang RPC library. The paper uses a range for buffering, whereas we do not implement buffering for simplicity. We introduced persistent storage to make sure that a the logs and other state stored on the servers can be recovered after a crash. Lastly, we test our system using a simple key-value store, compared to the more complex Byzantine File System the paper uses for evaluation, as we operated in a more time constrained environment.
We also fix the following parameters: Checkpointing is done on every hundred commands. The hashing algorithm we used was UInt64, since no algorithm was specified for hashing. As mentioned before, we did not implementing any of the optimizations detailed in the paper, save for clearing the log.
Results and Evaluation
In order to evaluate our system we set up 5 AWS EC2 instances, comprising of 4 servers and 1 client. In order to test if the system worked and to evaluate it, we compared the latency of our system with that of consensus using the Raft algorithm, as well as no replication. We evaluated the system over 3 trials and ran commands and collected data over 5 minutes. Our results are listed in the table below:
One must note that we cannot directly compare our results to the paper since all their evaluation was done on a Byzantine file system and not using a simple key value system, which is what we used for evaluation. Also, our implementation uses TCP while theirs used UDP, and we also did not implement any optimizations that were mentioned in the paper. However, there is a similar trend for the results. Specifically, PBFT overhead decreases as the total work to be done on the servers by for a specific command increases, this is because the overhead time as a fraction of the total time it takes to execute a command decreases.
Conclusion
We gained a new appreciation for implementing complex systems through doing this project. Initially, our goals were perhaps overly ambitious and underestimated the difficulty of getting a PBFT algorithm to work correctly, but such is the case for most 1 semester university research projects. In future, we plan to test and evaluate the code more thoroughly and the open source our PBFT library, perhaps with interfaces in different languages. Lastly, our sincere gratitude goes out to Professor Michael J Freedman and our TA Andrew Or for a fantastic class, as well as to our fellow classmates from whom we’ve learned much from this semester.
References
Castro, Liskov. Practical Byzantine Fault Tolerance.
About the Authors
Felix Madutsa is a computer science major at Princeton from Zimbabwe. He loves systems programming, philosophy and startups. [Twitter: @felimadu]
Andrew Spencer is a computer science major at Princeton. He loves futuristic technology and programming.
Avthar Sewrathan is a computer science major from South Africa. He loves startups, health and technology. [Twitter: @avthars Website: www.avthar.com]