Your Journey To Consensus (Part 3) — Byzantine Fault Tolerance and PBFT
This is the third part of our series Your Journey To Consensus and we will be covering Byzantine Fault Tolerance problem and a practical solution Practical Byzantine Fault Tolerance.
Previous articles:
What is Byzantine Fault Tolerance
Byzantine Fault Tolerance (BFT) appeared in the 80s with the paper: The Byzantine Generals Problem. It’s in this article where a formal definition of the problem appeared along with some solutions.
BFT problem is stated in the following way:
We imagine that several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general. The generals can communicate with one another only by messenger. After observing the enemy, they must decide upon a common plan of action. However, some of the generals may be traitors, trying to prevent the loyal generals from reaching agreement. The generals must have an algorithm to guarantee that
- All loyal generals decide upon the same plan of action.
- A small number of traitors cannot cause the loyal generals to adopt a bad plan.
The following image shows the problem in case of having one commander and two lieutenant:
This paper comes with a conclusion that the generals cannot make the decision unless the number of generals is strictly greater than three times the number of traitors. So, if we denote the number of traitors by f and the number of generals with n, we get the following famous statement:
The problem cannot be solved unless at least n = 3f+1
For more clarification, check the original paper. It’s well written and gives you an insight on how to come up with a solution to such problem.
So in this article, we gonna focus on a variation, actually the first practical formulation of the problem of Byzantine Fault Tolerance named Practical Byzantine Fault Tolerance. If you come from previous articles, you will notice the similarities in the mechanisms since this one also will follow a leader based approach in a certain way to solve the BFT problem.
Practical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance algorithm can be used to implement any deterministic replicated service with a state and some operations. In this service, clients issue requests to invoke operations and block waiting for a reply. These operations are not only read and write operations, but can be anything.
System properties:
- The replicated service is implemented by n replicas.
- Clients and replicas are not faulty if they follow the algorithm.
- The algorithm does not address the problem of fault-tolerant privacy: a faulty replica may leak information to an attacker.
- Replicas should be deterministic and should start from the same state.
How the algorithm works:
PBFT is a form of state machine replication where the service is modeled as a state machine that is replicated across different nodes in a distributed system. Each state machine replica maintains the service state and implements the service operations.
One of the main mechanisms in this system are View Changes. View Changes are carried out when it appears that the primary has failed, so another replica tries to take over his place by starting an election process.
So the algorithm works as follows:
- A client sends a request to invoke a service operation to the primary.
- The primary multicasts the request to the backups.
- Replicas execute the request and send a reply to the client.
- The client waits for (f+1) replies from different replicas with the same result. The (f+1) is because we assume that we have ‘f’ faulty nodes on the network.
So for the client, the algorithm works as follows:
- Client C sends request message to the primary.
- Timestamp t is used to ensure exactly-once semantics for the execution of client requests.
- The request contains the view number, which is used to track the view and know the primary.
- Then the primary multicats the request to all backups.
- If the client does not receive replies soon enough, it broadcasts the request to all replicas:
- If the request was already executed but the client didnt receive the response: Resend him the response
- If not, transmit the request to current primary. If the primary does not multicast it to the group, it will eventually be suspected to be faulty by enough replicas to cause a view change.
Normal Case Operation:
during the normal execution of the algorithm, ie a leader is already chosen. When the leader receives a client request, m, it starts a three phase protocol to atomically multicast the request to the replicas.
These three phases are as follow:
Pre-Prepare:
- The primary assigns a sequence number, n, to the request.
- Then it multicasts a pre-prepare message containing the request to all the backups and then appends the message to its log.
- A backup accepts this pre-prepare message if:
- The signature in the request and pre-prepare are correct
- The backup is in the same view as the one in the pre-prepare message.
- it still didn’t receive any request similar to this one.
- The sequence number in the pre-prepare message is between a low-watermark and a high-watermark defined during the leader election.
Prepare:
- If the pre-prepare received by the backups is accepted, they enter the prepare phase by multicasting a prepare message to all other replicas and add both messages to its log. Otherwise, it does nothing.
- A replica, including the primary, accepts prepare messages and adds them to its log provided their signatures are correct.
- We define the predicat prepared to be true if and only if replica has inserted its log and received 2f prepares that match that prepare.
These two first phases guarantee that non-faulty replicas agree on a total order for the requests within a view.
Commit:
- Replica muticasts a commit message to the other replicas when prepared becomes true
- Replicas accept commit messages and insert them in their log provided they are properly signed, the view number is the same as them and the sequece number is between the water marks.
- We say a result is Committed if and only if prepared is true for some set of “f+1” non-faulty replicas.
- Committed-local: true if and only if prepare is true for “2f+1” replicas.
- Changes are executed after Committed-local.
View Changes Case:
The view change mechanism allows the protocol to progress after failing, ie the leader crushing or doing some malicious activity. It gets triggered by timeouts that prevent backups from waiting indefinitely for requests to execute.
In fact, a timer start after receiving a pre-prepare message. After it expires, the backup starts a view change to move the system to view v+1. So it stops accepting messages and multicasts a view-change message to all replicas along with the latest checkpoints, ie parts of data, that were agreed upon on the network.
After receiving 2f responses, the replica that will become the next leader multicasts a New-View message on the network. Then, the new leader will send a set of pre-prepare messages:
- Determining the lowest and highest watermark;
- creating a new pre-prepare message for the new view for each sequence numbered between the watermarks and multicast them to the whole network so that nodes that don’t have the complete truth will have it.
Replicas will receive these messages and if they accept them, they will multicast them to all other replicas.
And thus another leader will be pronounced and he will carry on with the protocol as in the normal case.
Conclusion
With this we finish this article about BFT and PBFT. In the next one, we gonna discover a more interesting way of solving the BFT problem but with more effeciency. It’s gonna be HotStuff, the consensus Facebook gonna use in it’s new cryptosystem called Libra.
Link to the following article:
Reference
- The Byzantine Generals Problem: https://dl.acm.org/citation.cfm?id=357176
- Practical Byzantine Fault Tolerance: https://www.researchgate.net/publication/2516268_Practical_Byzantine_Fault_Tolerance