TOP Network Technical Spotlight | Sharding (Part 2) | Security Through Randomness
In the previous article, we introduced the three different forms of sharding. These were state sharding, compute sharding, and network sharding. Now we will lay out the major challenges involved with sharding a blockchain, after which we will describe how TOP Network begins to tackle each of these difficulties.
Blockchain sharding is extremely tricky, and there are many potential complications that can arise. Of course, if it was easy, most blockchain projects would have already implemented some form of sharding by now. Some of the main challenges that must be considered in relation to sharding include lessened security, data availability, and the introduction of cross-shard transactions and synchronization. In this article, we will tackle the first of these issues.
Security and Fault Tolerance in Sharded Blockchains
Most sharded blockchains use some form of BFT PoS consensus mechanism. Recall from this article that BFT consensus mechanisms require at least 2/3 of nodes to be honest in order to reach consensus. If more than ⅓ of the nodes are faulty, consensus will not be reached. A faulty node could represent an offline node, or one with malicious intent. Another way of stating this is that BFT algorithms are 33% fault tolerant — meaning they can tolerate up to 33% of nodes being faulty before the mechanism breaks.
Consider a blockchain system that has 900 validators/block producers, which could represent PoS or DPoS nodes. Using a BFT consensus algorithm, the blockchain would remain safe given no more than 300 nodes are faulty (⅓ of all nodes). If a malicious entity were to create over 300 nodes — or bribe currently active ones — it could initiate a 33% attack, bringing the blockchain to a halt, or possibly worse. However, depending on how difficult it is to create a node, a fault tolerance of 300 nodes could be considered a reasonably high number.
Now let’s imagine what would happen if this same blockchain system were sharded in terms of computation — into 3 shards for example. The network of consensus nodes would be divided into thirds, with each shard assigned 300 validators. Each group of nodes in each shard would process transactions in parallel using the same BFT algorithm.
It is important to ask the question regarding how many malicious nodes would be needed to compromise a single shard. Again, if any BFT consensus group has over 33% of nodes acting dishonestly, the consensus algorithm fails. In our scenario, each shard has 300 consensus nodes, so anything over 100 (>⅓ of 300) faulty nodes would break the security guarantee.
Notice that before sharding, a malicious entity would have needed to take over a minimum of 301 nodes to overcome the fault tolerance threshold. With three shards, the entity would only need to take control of a minimum of 101 nodes to compromise a single shard, which is 1/9 of the total number of nodes in the blockchain system. So sharding reduced the fault tolerance threshold to about 11%, and this was only with three shards. What if we used 15 shards? Using this number, the fault tolerance threshold would drop to only 2.2%! For our model with 900 nodes, a malicious entity would only need to take control of about 21 nodes to compromise one of the 15 shards. This threat model is dubbed the Single-Shard Takeover, and it is one of the most important issues that a sharded blockchain must account for.
Although with 15 shards a fault tolerance of 2.2% sounds dismal, it is not nearly as bad as it sounds. This calculation is a worst-case scenario which assumes that every one of a malicious entity’s nodes will be placed into the same shard, but who decides which nodes go into which shards? In practice, almost every sharding implementation uses some sort of random selection process to place nodes into shards.
While for our example with 15 shards and 900 nodes a malicious entity would technically only need to create 21 nodes to compromise a single shard, the chance that all 21 of its nodes are randomly placed into the same shard is extremely low. In fact, the probability of this occurring is around 1 in 693 quadrillion! This is assuming only 21 malicious colluding nodes out of 900. Let’s use a more pessimistic number, such as 225 malicious nodes (25% of 900). Then the probability that a particular shard out of the 15 has at least 21 malicious nodes is 5.4%, which is substantially higher. Requiring a higher minimum number of nodes per shard would reduce this probability.
From this discussion, the trade-off is clear. More shards leads to higher throughput, but lowers security. By introducing additional methods to increase security, a higher number of shards can safely be used, which in turn increases scalability.
TOP Network’s Approach
The way TOP Network approaches the Single-Shard Takeover attack vector is three-fold: layered randomness, secondary audits, and periodic shuffling.
Where Does Randomness Come From?
Before we explain the layers of randomness, it is important to know how randomness is generated. The source of randomness is crucial for security. To function safely, it must be verifiable, and bias-resistant. If the randomness can be predicted or manipulated, a malicious entity could direct its nodes into a single shard, allowing for the Single-Shard Takeover previously discussed.
TOP Network employs a Verifiable Random Function (VRF) to generate randomness. A VRF enables each node to create a random seed based on some specific input data, and then allows others to verify if the random seed was truly generated by the corresponding node based on the correct input data as per the protocol rules. These random seeds provide the means for random sorting.
The properties of VRFs are important to stop malicious nodes from finding ways to bias the random sorting process. The key methods of VRFs are as follows:
- Generate (private key, input data) → (random output, proof)
Using its own private key and some protocol determined input data, a “prover” will generate a random output and a proof.
- Verify (public key, input data, random output, proof) → True/False
Verifying peer nodes can use the proof presented by the prover along with its public key to verify if the random output was actually generated by the prover using the correct input data as laid out by the protocol rules. This ensures that a node cannot just input arbitrary values until a desired random output is obtained. Since the random output is unique, the network can be sure it was generated by a particular node.
From a high level, VRF based sorting is like a lottery. Each node checks to see if they “won” by inputting their account address and some other public data into the VRF. If the output is a “winning” random seed, they are sorted into a shard. To make sure that a node actually has a winning ticket, other nodes will double check using the Verify feature of the VRF.
Security Through Layered Randomness
Now that we have a general sense of how VRFs generate randomness in a verifiable way, let’s see how they are applied in TOP Network. In total, there are 7 different layers of randomness in the TOP Network protocol.
The first layer of randomness is in how DPoS* consensus nodes are sorted into shards. Consensus nodes can become eligible for selection through the comprehensive stake requirement. The pool of eligible nodes is then sorted into shards through a VRF mechanism. Here’s a simplified version of how that works:
- Each node will insert their private key along with the agreed upon input data into the VRF to generate a unique random seed along with a proof.
- Nodes will verify that their peers did not cheat in acquiring their random seed by using the Verify method of the VRF.
- With the random seeds, an improved Follow-The-Satoshi (FTS) algorithm is used to select a minimum of 127 nodes for each shard. The chances of selection are thus weighted according to each node’s stake
In this way, nodes have no idea which shard they will be placed in, and the results are bias-resistant and verifiable.
Within each shard, a consensus committee consisting of at least 29 nodes is randomly selected using another VRF mechanism to execute pBFT consensus on transactions. A new consensus committee is randomly selected after each round. In addition, the leader for each committee is selected randomly. These are the second and third layers of randomness.
The fourth layer of randomness comes from the sorting process of Advanced Nodes in the Routing Network. The Routing Network is a layer above the consensus network which handles transaction routing and state synchronization, while also performing secondary audits on all transactions. The Routing Network is split into “clusters,” which each manage a certain set of shards. A VRF mechanism is used to sort the Advanced Nodes into clusters in a similar fashion as described with consensus node partitioning into shards.
When a transaction is validated by a consensus committee within a shard, it must be audited by three randomly selected Advanced Nodes in the governing cluster, introducing a fifth layer of randomness. If a shard is somehow compromised, the secondary audit provides an additional security mechanism so that no malicious transactions will go through.
The malicious shard would need to know which Advanced Nodes would be auditing the transaction, which is essentially impossible to know beforehand thanks to VRF based random selection. Taking over the entire cluster of Advanced Nodes in addition to the shard is the best option, which again is exceedingly hard due to the random sorting of both Advanced Nodes and consensus nodes. Furthermore, if a consensus committee does attempt to send out a faulty transaction, the auditors can identify it as such, and help reveal the offending malicious node(s).
The sixth layer of randomness comes from the path a transaction takes to reach a consensus committee. All transactions must go through the Edge Network, which provides the client access points. When an Edge Node receives a transaction, it passes it to a random Advanced Node within a cluster. The Advanced Node that receives the transaction will then pass it to a random consensus node within a shard. Both random relay processes make use of a VRF mechanism.
Although the multiple layers of randomness discussed thus far provide very high security, there is still a potential vulnerability that could be exploited by a very motivated bribing attacker. In the bribing attacker threat model, a well-funded entity attempts to slowly take over a shard by bribing nodes after they have been randomly sorted. In TOP’s architecture, the Advanced Nodes must also be bribed to bypass the secondary audit.
To make this sort of attack much harder, TOP Network employs a node shuffling mechanism. Periodically, a few nodes from each shard will be randomly relocated to new shards, adding a seventh layer of randomness to the equation.
Over time, shards will have completely different nodes then they had previously. This happens continuously, which makes it fruitless to bribe nodes in a single shard and its overseeing cluster unless it can be done in a very short period, which is unrealistic. A bribing attacker typically needs time to slowly find ways to communicate with and bribe nodes, which is made even harder since the consensus network is hidden behind the Edge and Routing Networks. By the time a significant number of nodes in a shard have been swayed, many will have been reshuffled to a new shard, with others taking their place. This makes it very difficult to acquire the majority of nodes in a single shard and governing cluster, which is the prerequisite for malicious activity.
While shuffling greatly increases security, it introduces a new set of problems. Each time a node is relocated, it must sync with the state of its new shard. In most suggested implementations of sharding, this is very difficult, as the state can be quite large, even for a single shard. For TOP Network, this is made much easier due to the dual-lattice structure, which allows for real-time pruning. Additionally, since only a few nodes are shuffled at a time, consensus is not interrupted as syncing is taking place. The dual-lattice structure will be explained with more detail in another article.
To recap, TOP Network solves the Single-Shard Takeover issue through extensive randomness, secondary audits, and node shuffling. TOP Network does not take security lightly, as demonstrated by the implementation of 7 separate forms of randomness. In the next article, we will discuss the cross-shard transaction and synchronization problem, and learn how TOP Network overcomes this challenge.