An Introduction to PBFT Consensus Algorithm
PBFT (Practical Byzantine Fault Tolerance) consensus algorithm allows a distributed system to reach a consensus even when a small amount of nodes demonstrate malicious behavior (such as falsifying information). During information transmission, PBFT uses cryptographic algorithms such as signature, signature verification, and hash to ensure that everything stays irrevocable, unforgeable, and indisputable. It also optimizes the BFT algorithm, reducing its complexity from exponential to polynomial. In a distributed system that constitutes 3f+1 nodes (f represents the number of byzantine nodes), a consensus can be reached as long as no less than 2f+1 non-byzantine nodes are functioning normally.
Proving PBFT Algorithm
A distributed system that contains 3f+1 nodes can have at most f fault nodes (byzantine nodes). When 2f+1 nodes reach a consensus regarding a certain message, the whole system will reach a consensus, too.
First, reaching a consensus at its core is a matter of majority rule. Nodes in the system will act on the consensus reached by the majority of the nodes. There are a certain number of fault nodes in the system, and these fault nodes might broadcast to a group of non-fault nodes (non-byzantine nodes) that it has sided with a certain request while broadcasting to a different group of nodes the exact opposite. It plays the role of a fence-sitter, thereby interfering with the nodes’ consensus reaching process. Worst case scenario, f fault nodes would jointly interfere with the consensus process. Next, we will prove that the system will still be able to reach a consensus under the worst circumstance.
If at a certain time the non-fault nodes are divided into two parts due to network partition: f non-fault nodes side with message A, then these nodes constitute Set (A, f); meanwhile, f non-fault nodes side with message B, so these nodes constitute Set (B, f); and f fault nodes tell the non-fault nodes on A’s side that they support message A and tell the ones on B’s side that they support message B. Now in Set(A, f)’s point of view, message A has gathered 2f votes. In Set (B, f)’s point of view, message B has gathered 2f votes. Then after the remaining 1 non-fault node takes a stance, one of the sets would become the majority and the other would become the minority. Message A would receive 2f+1 votes, and message B would receive 2f votes and vice versa. Continue with the example. There are only f fault nodes in the system, so non-fault nodes that side with message A must have f+1 votes while those who side with B must have f votes. According to the majority rule, the system is able to reach a consensus. After the network partition is fixed, Set (B, f) will also know that a consensus is reached and will therefore execute it.
In other words, in a distributed system with a maximum of f fault nodes, as long as there are 3f+1 nodes, the majority of the non-fault nodes will always be able to reach a consensus no matter how the f fault nodes interfere(s) with the process.
Process of PBFT Algorithm
Client: Client nodes that are responsible for sending transaction requests.
Primary: Main nodes that are responsible for packing transactions into blocks and finalizing blocks. Each consensus-reaching process has one and only one Primary node.
Replica: Replica nodes that are responsible for finalizing blocks. Each consensus-reaching process involves multiple Replica nodes, and they all proceed in a similar way.
Both Primary and Replica nodes are consensus nodes.
Process of the algorithm
PBFT consensus algorithm constitutes the following steps: client sends requests; system executes the three-phase consensus process; client receives the response and confirms that the consensus is reached. (Shown in Fig. 1)
Client sends a request
As shown in Fig.1, Client D sends a request to the system, and consensus nodes (R0, R1, R2, R3) receive the request. After the Primary node (R0 in this case) broadcasts the pre-prepare message, the system starts to execute the three-phase consensus. The request from the client here can be viewed as a set of multiple transactions.
Three-phase consensus protocol
PBFT consensus consists of three phases: Pre-Prepare, Prepare, and Commit. Together, they form the core of the PBFT consensus algorithm:
Pre-Prepare: Primary node is responsible for verifying the requests and generating corresponding pre-prepare messages. Then, the Primary node will broadcast pre-prepare messages to all Replica nodes. After receiving the messages, Replica nodes will verify the legitimacy of those pre-prepare messages and then broadcast a corresponding prepare message.
Prepare: Gathering prepare messages. After a certain node gathers 2f+1 prepare messages, it will announce that it is ready for block submission and start to broadcast commit messages;
Commit: Gathering commit messages. After a certain node gathers 2f+1 commit messages, it will process the native requests cached locally and make corresponding changes to the system state.
The Primary node (such as R0 in Fig.1) sends pre-prepare messages <<PRE_PREPARE，v，n，d>, m> to other Replica nodes (such as R1, R2, R3 shown in Fig. 1). V represents the view number, n is the serial number, d represents the message summary, and m represents the original message data.
Replica nodes receive pre-prepare messages and verify the following:
a. the legitimacy of the signature of m and whether d is compatible with m: d=hash(m)
b. if the node is currently in v
c. The node does not have other pre-prepare messages on the same page (view v ，sequence n). Namely, there isn’t another m’ and d’, where d’=hash(m’)
d. h<=n<=H, H and h represent the high and low thresholds of n.
After the verification is successfully done, Replica nodes send out the corresponding prepare messages<PREPARE，v，n，d，i>. The i represents the identity of the Replica node.
Consensus node i receives 2f prepare messages from other consensus nodes (a total of 2f+1 including its own) and verifies if the v, n, d of these messages are all consistent with the ones it sent out. Once verification is completed, consensus node i will set prepared(m, v, n) to true. Prepared(m, v, n) shows whether the consensus node deems the Prepare phase of message m in (v, n) is completed. At last, consensus node i sends a commit message<COMMIT，v，n，d，i> and enters the Commit phase.
Consensus node i receives 2f commit messages from other consensus nodes (a total of 2f+1 including its own) and verifies if the v, n, d of these messages are all consistent. Once verification is completed, the consensus node will set committed-local(m, v, n) to true. Committed-local(m, v, n) shows the consensus node confirms that message m is voted for by at least 2f+1 nodes, or f+1 non-faulty nodes.
Response to requests from Client
When Client D receives f+1 identical commit message, it confirms that a consensus is reached on its request. Previously we proved that f+1 identical commit messages contain at least 1 from a non-fault node, and a non-fault node will only send a commit message when at least 2f+1 nodes have voted for the request. Thus Client can confirm that consensus is reached on its request.
View-change protocol is crucial to PBFT consensus algorithm. When there’s no consensus on a request in a limited time, the old data maintains consistency, and the system keeps the current status. To fulfill systematic availability, we need a new scheme.PBFT consensus algorithm applies the view-change protocol to render the system available again. When the view-change protocol is executed, a new Primary node will be elected to strike a consensus and respond to Client in a limited time, so as to fulfill the availability requirement.
The core reason for triggering the view-change protocol is that the Replica node confirms that in limited time, the current Primary node cannot reach a consensus on the exchange request. It may be because the Primary node is temporarily unavailable or it’s a fault node, or the network is unstable for the current distributed system, etc. View-change protocol needs to take into account multiple possibilities to realize tolerance for Byzantine nodes. For example, Byzantine nodes might trigger the view-change protocol, and the Primary node might be a Byzantine node in the new view, which we won’t elaborate on here.
It should be noted that the view change will not roll back a commit message. In other words, a consensus in the old view (n) is still valid in the new view.
In the PBFT three-phase protocol, the Primary node initiates the consensus procedure, while Primary and Replica nodes participate in message verification and voting. If a consensus cannot be reached among all consensus nodes in the distributed system, the view-change protocol will be executed and a newly elected Primary node will initiate a new consensus procedure. If there’s consensus on the message, all nodes will execute the message, and the system status is changed. As status change is not possible without a consensus, the system maintains consistency, or in other words, the current status is irreversible. Meanwhile, the view-switch protocol in PBFT consensus algorithm guarantees the safety and activity of the entire distributed system, so that the system can swiftly recover from unavailability to continue to provide services.
PBFT Pros and Cons
PBFT consensus algorithm tolerates faults of a certain number of Byzantine nodes to provide safety and activity for an asynchronous distributed system. Improving on the BFT (Byzantine Fault Tolerance), PBFT algorithm lowers the systematic complexity from the exponential level to the polynomial level, so that BFT is applicable to real systems.
PBFT consensus mechanism guarantees that the distributed system offers strong consistency. It is fit for scenarios on private and alliance chains.
PBFT features complex communication and low scalability. When the number of consensus nodes in the distributed system is large enough, the functionality will go down dramatically. When there’s a great deal of network partition, consensus efficiency will be compromised and result in a huge delay.
PBFT consensus algorithm applies the pattern where status only changes when consensus is reached, which ensures the real-time strong consistency of the distributed system. It also sets the example for the application of the DPoS consensus algorithm, as currently it takes at least 18 time slots for blocks on TRON network to reach a consensus and that poses a difficulty for some decentralized applications that demand high-level status consistency.
In the third article, we will introduce a significant improvement in the TRON network’s consensus mechanism, achieved with the combination of the DPoS consensus algorithm and PBFT three-phase protocol. It enabled TRON to ensure system consistency while offering highly available services.
For more information