In the past month, we have been working on our view change protocol. This is a core part in any blockchain from which we can tell whether a protocol is permissioned or permissionless and how decentralized it is.
The code is here: harmony/consensus
We launched our Day ONE Mainnet with a total of 600 nodes on 4 shards on June 28th. It has been running smoothly for the past 830,000 blocks and counting.
In our previous tests, we observed view change (i.e. leader change) happened in some shards due to bad network condition. We also manually triggered view change by killing the leader node as well as other kinds of attacks. The view change happened after the attack and the network keeps going as expected —heck yeah!
In the following, we will first explain the basic concepts of byzantine fault tolerance, then second how we improve it to handle large number of nodes in practice, and finally the overall code structure and some implementation details.
What is Byzantine Fault Tolerance?
A distributed system consists of multiple nodes where each node is an independent server. They communicate with each other by sending messages over the network and perform some tasks according to the protocol they conform to.
There are many types of faults, but they can basically be classified into two major categories. The first kind of faults are node crash, network down, packet loss, etc., where the node had no malicious intention. These are non-byzantine faults.
The second type is where a node can be malicious. It can act arbitrarily and not follow the rules of the protocol. For example, a validator can delay or refuse to relay the messages in the network, the leader can propose invalid block, or a node can send different messages to different peers. At worst, malicious nodes may collaborate with each other. These are called Byzantine faults.
With these two kinds of faults in mind, there are two properties we want the system to maintain: consistency and liveness.
In blockchain terminology, consistency means honest nodes must commit the same block for any given block number/height; liveness means the chain height must keep growing without being stuck.
In a permissioned network, where we only have the first type of faults (non-byzantine), this is easier to achieve. For example, we can pick one powerful node as leader and all the other nodes will just listen to what the leader broadcasts and trust any block the leader proposes. Even in this case, we need to watch out for the first kind of faults, especially when it happens to the leader.
For a fully decentralized network, we cannot trust any single node and assume the second kind of faults can happen in any node. There is only one fundamental assumption, which is a malicious node cannot forge the signatures other nodes signed. This is guaranteed by cryptography theory, where the difficulty of forging a signature is so high that no computer today can break in any practical time. Things might change when the quantum computer is ready. But at that time, we will use quantum-resistant cryptographic algorithm instead.
A Byzantine Fault Tolerant protocol is a protocol that can guarantee the consistency and liveness of the distributed system even when there are malicious nodes in the system. All such protocols have the basic assumption that the number of malicious node is less than some threshold. This is easy to understand, if there are over 50% malicious nodes, then the network is fully controlled by the malicious nodes.
In the case of prove of work (PoW) in Bitcoin, the requirement is less than 50% of nodes (in the computation power sense) are malicious. However, the selfish mining lowers the basic assumption to 25%. i.e. the system of PoW will be safe only if less than 25% of nodes (in the sense of computation power) are malicious.
There is deep research on Byzantine Fault Tolerance protocol in traditional distributed systems. It is proven that malicious nodes should be less than 33% of a network in the classical paper of Lamport. Later on, the famous practical byzantine fault tolerance paper PBFT makes such system practical.
There are still two remaining issues. First, such a system is permissioned, which not allow arbitrary nodes to join and leave. Second, it is not scalable to more than hundreds of nodes. The first issue is due to Sybil Attack, where a malicious user can easily create many fake identities and take over the majority of the network. It is first solved in Satoshi Nakamoto’s Bitcoin whitepaper, where the economic effect is taken into consideration. After proof of work (PoW), there are many new designs such as proof of stake (PoS), proof of authority (PoA) etc. Instead of counting the number of nodes, we counting the amount of voting power. In PoS, the voting power of a node is proportional to its staking amount. The second issue is solved by aggregated signatures using BLS signature scheme which is explained in the FBFT section.
Practical Byzantine Fault Tolerance
For protocols like Raft and Paxos, they are used to deal with the first kind of system faults. Practical Byzantine Fault Tolerance (PBFT) is one of the first Byzantine fault tolerance protocols used in the real world to deal with the both first and second kinds of faults.
We will always assume there are N nodes with at most f malicious nodes, where N=3f+1. There are two modes in the PBFT, the normal consensus (normal in short) mode and the view change mode. The normal mode looks like this (in blockchain, the client request and reply can be ignored):
In one view (one view is a similar concept of one round), there are 3 steps/phases: pre-prepare (announce), prepare and commit.
- In pre-prepare(announce) phase, the leader will broadcast announce message (e.g. the proposal block) to other nodes (called validators). When a validator receives announce message, it enters prepare phase.
- In prepare phase, after a validator receives announce message, it will broadcast prepare message (e.g. signature on blockhash) to every node. When a validator (including the leader) receives enough (i.e. ≥2f+1 ) prepare messages, it will enter commit phase.
- In commit phase, a validator (including the leader) will send commit message (e.g. signature on |blockNum|blockHash|) When a validator receives enough (≥2f+1) commit messages. It can safely commit the block. This ends one round of normal consensus process.
Notice that there are some differences between general PBFT and blockchain PBFT. The major difference is that the blockchain is “synchronized” between two blocks, i.e. we cannot proceed to commit block h+1 before commit h. In the traditional PBFT, we can commit client request h+1 before request h. The PBFT will guarantee consistency across all the nodes.
In this sense, the blockchain makes consensus process simpler. To be precise, there is a process in PBFT called checkpoint process. A checkpoint is a certificate that all the information with sequence number (in blockchain, it is the block number) less or equal than checkpoint’s sequence number are finalized. In blockchain, each committed block is finalized and can be viewed as a checkpoint.
When a validator cannot commit a new block before consensus timeout (ΔT≥T0), the validators will start view change (v→v+1), the new leader is uniquely determined in a predetermined way. If the view change cannot finish before timeout (ΔT≥T1), the validator will propose another view change (v+1→v+2, with view change timeout increases to 2*T1).
There are 2 steps/phases in view change mode:
- A validator starts view change by sending view change message containing ≥2f+1prepare messages to new leader. If it doesn’t receive enough prepare messages, it just send view change message without any prepare message to the new leader.
- The new leader collects enough (≥2f+1) view change messages and broadcast new view message containing view change messages it receives. Then the new leader switches to normal mode. A validator switches to normal mode when it receives new view message from the new leader, at the same time, it stops the view change timer and start consensus timer. If the validator doesn’t receive new view message before view change timeout, it will increase viewID by one and start another view change.
View change guarantees liveness of the network. During the view change process, we need to make sure the block committed is consistency across view change as well. Simply speaking, the receiving of 2f+1 prepare messages only ensure the consistency in the same view. The receiving of 2f+1 commit messages ensure consistency across different views. When a node receives 2f+1 commit message, it can safely commit the block into the blockchain. The PBFT protocol ensures the same block will be committed by any honest nodes even in the case of view change.
CONSISTENCY AND LIVENESS
The key concept in PBFT is the quorum. A quorum is any subset with at least 2f+1 nodes. Since there are a total of 3f+1nodes, any two quorums will intersect at least f+1nodes. Based on the assumption that there are at most f malicious nodes, there will be at least one honest node in the intersection of two quorums. This is the reason why we need a quorum to take any action.
Consistency in one view: Suppose a node received 2f+1 prepare message, these 2f+1 nodes form a quorum. Notice any two quorums will have at least one honest node in common, it means any two such quorums cannot contain different block hashes in their prepare messages, otherwise the honest node in common admits two different blocks of the same height which contradicts the fact it’s honest.
Consistency across different views: Suppose a node received 2f+1 commit message, these 2f+1 nodes form a quorum, denote it as Q1. When an honest node starts view change, it will send its prepared message (contains 2f+1 prepare messages) to new leader. The new leader needs to collect 2f+1 view change messages (denote as quorum Q2) in order to send new view message. Again Q1 and Q2 contains at least one honest node. This node contains 2f+1 prepare messages because it received enough prepare messages before it sent out its commit message. This ensures the same block will be committed by honest nodes across different views.
Liveness: Each node has a timer for normal consensus process (with T0 timeout) and a timer for view change process (with k*T1 timeout, where k is how many view changes happened before a validator can switch back to normal mode). When the timer timeout, the node will start view change by increase view by one. In the case of consecutive leaders fail to send correct new view messages, the timeout period of view change timer will be increased to avoid frequently view changes and to make sure eventually enough honest nodes will have same viewID with honest new leader.
Fast Byzantine Fault Tolerance
As an improvement on PBFT, Harmony’s consensus protocol is linearly scalable in terms of communication complexity, and thus we call it Fast Byzantine Fault Tolerance (FBFT). In FBFT, instead of asking all validators to broadcast their votes, the leader runs a multi-signature signing process to collect the validators’ votes in a O(1) -sized multi-signature and then broadcast it. So instead of receiving O(N) signatures, each validator receives only one multi-signature, thus reducing the communication complexity from O(N²) to O(N) . With some modifications in view change message, the view change complexity can also be reduced to O(N).
BLS SIGNATURE SCHEME
Here we give a very brief and mathematical introduction to Boneh–Lynn–Shacham (BLS) signature scheme which is main distinguisher between FBFT and PBFT. The BLS signature scheme is based on elliptic curve pairing. Let E(Fp) to be the elliptic curve over finite field Fp where p is a large prime number. We pick a basic reference point g on this curve. The private BLS key is a random number α sampled from Fp and the public key is α⋅g which is a point on E. Given a message m, the signature is calculated as σ=α⋅H(m) which is a point on E, where H is a hash function to E. A bilinear map on two elliptic curves E1 and E2 is a pairing if
Now we can see how the k signatures are aggregated and verified by aggregated public key.
Notice the aggregated signature looks like normal signature which is a point on elliptic curve, the aggregated public key looks like normal public key which is also a point on elliptic curve. This reduces the 2f+1 signatures into just 1 aggregated signature which is critical to reduce network traffic in consensus protocol.
In traditional PBFT, the total message size a node sends or receives in each round of consensus O(N²). This is because in prepare and commit phases, every node need collect 2f+1=O(N) signatures and broadcast them to every node (i.e. O(N) nodes) in the network. By using BLS signature scheme, we aggregate the 2f+1 signatures into one signature, this way the message size in prepare and commit phases are O(1), which reduces the total size from O(N²) to O(N) in one round. To benefit from BLS scheme, every validator will send prepare and commit message to leader only and the leader is responsible to collect enough >=2f+1 signatures and aggregated them into one aggregated signature, after that the leader send the prepared/committed message in prepare/commit phase respectively. From the leader’s perspective, the three phases are synchronized, but from validators’ point of view, they can still receive messages out of order, e.g. a validator can receive prepared message before announce message, however in this case, its prepare signature will not be included in the prepared message.
There are three phases in the normal mode:
- In announce phase, the leader will broadcast announce message (e.g. the proposal block) to validators. When a validator receives announce message, it enters prepare phase
- In prepare phase, the validator sends prepare message (e.g. signature on blockhash) to leader. When leader receives enough (i.e. ≥2f+1) prepare messages, it aggregates signatures of prepare messages received from validators and sends out prepared message contains aggregated prepare signatures. Then the leader enters commit phase. A validator enters commit phase when it receives prepared message from the leader.
- In commit phase, the validator sends commit message (e.g. signature on |blockNum|blockHash|) to leader. When the leader receives enough (i.e. ≥2f+1) commit messages, it aggregates signatures of commit messages received from validators and sends out committed message contains aggregated commit signatures. Then the leader finishes one view/round. A validator finishes one view/round after it receives committed message. When the leader or validator finishes one round, it will restart the consensus timer.
In step 3 commit phase, the validator sends commit message with signature on blockNumber and blockHash. This is convenient for the node to quickly determine whether it is out of sync without tricked by the malicious leader. How the consensus process interacts with state syncing is explained in the state syncing mode section.
There are two causes for validators to start view change process. One cause is when a validator detects the leader proposed two different announce messages in one view, it will immediately start view change. The other cause is a validator doesn’t make any progress after timeout. There are two kinds of timeouts: timeout in normal consensus mode and timeout in view change mode.
In our blockchain, we have the concept of epoch. Each epoch contains X Blocks (e.g. X=1000). In the beginning of each epoch, the committee members are determined by who has staked for this epoch in the beaconchain. The order of the committee members is uniquely determined by the VDF randomness of this epoch. During one epoch, the committee will always stay the same. Suppose the order list is [v0,…,vn]. Then in the beginning of epoch, the leader is v0. If view change happens, the next leader is v1, and so on. Here we assume each validator has equal voting power.
VIEW CHANGE MODE
The view change process is as follows:
- When the consensus timer timeouts, a node starts view change by sending view change message including viewID and prepared message (containing ≥2f+1 aggregated signatures) to new leader. If it doesn’t receive prepared message, it just sends view change message including signature on viewID but without prepared message.
- When the new leader receives enough (≥2f+1) view change messages, it aggregates signatures of viewID and just pick one prepared message from view change messages. It broadcasts new view message including aggregated signatures as well as the picked prepared message. Then the new leader switches to normal consensus mode. A validator switches to normal consensus node when it receives new view message from the new leader, at the same time, it stops the view change timer and start the consensus timer. If the validator doesn’t receive new view message before view change timeout, it will increase viewID by one and start another view change.
The second step requires each validator to send signature on viewID, the purpose is to reduce the size of new view message from O(N) to O(1) when the previous leader is malicious. To be precise, the previous leader can send different aggregated signatures to different validators in prepared phase. As long as the aggregated signature is valid, the validator will accept it and propose it when view change happens. In this scenario, the size of new view message is O(N), because the new leader has to prove the receiving of enough valid view change messages. With everyone signed on viewID, it’s easy for new leader to aggregate the signatures to reduce new view message size to O(1) again. Only this way, we can scale up the number of nodes in the network in the case of view change.
STATE SYNCING MODE
We allow a node to join and leave freely in blockchain. When a new node joins consensus, it has to do state syncing before it can validate consensus messages. Also, there are situations when a node gets stuck in view change mode. e.g. When a validator has slow network connection, it may not be able to make any progress before timeout. In this case, it will start view change. However, there is no way it can escape from view change mode because all the other nodes are moving forward and the view change will fail. In this case, this node needs to do state syncing to catch up.
The basic process is simple. When a node detects it’s out of sync by comparing its current block height with the latest block height in committed message, it will switch to state syncing mode and start doing state syncing. After it finishes state syncing, it switches to normal mode.
In order to join the consensus after state syncing finished, a node needs to know who is the current leader and what is the current viewID. One solution is to blindly accept whatever the leader and viewID from the consensus message. This approach gives the malicious leader the chance to send a large viewID along the consensus message to make every validator starts doing consensus. A better way is only accept the leader and viewID information when the node receives the committed message from the network. In this case, the malicious leader cannot trick the new node. But it slows down the process of new node joining consensus because it cannot verify announce and prepared messages until leader and viewID updated. The approach we choose is to add leader and viewID information into the block header. When a node finished state syncing, it can read the information from the latest block header. If during the state syncing, the view change happened, the information from the latest block is outdated. In this case, a new node can still update the leader and viewID information when it receives committed message.
The following drawing is the state transition graph of a validator. The state transition graph of a leader is simpler and omitted here.
There are 5 modes: 3 normal modes (A: Announce, P: Prepare, C: Commit), view change mode (VC) and state syncing mode (S).
The transitions between modes are triggered by different conditions such as receives a specific type of message or meet some conditions like timeout.
Condition list: a.m (announce message), p.m (prepared message), c.m (committed message), t.c (try catchup success), t.o (timeout), n.v (new view message), i.s (in sync), o.s (out of sync).