Battle Report: 7 days, 2 major issues, 2 fixes, improved resilience

Lucian Mincu
Elrond Network
Published in
5 min readJun 12, 2020
  • Battle of Nodes: Onchained produced two useful results after only 4 days
  • Anti-flood mechanism prevented consensus in high TPS — fix
  • Redundant validator setup caused double-signing edge-case & fork — fix

🌎 1609 validators & 1838 nodes ⚡13.000 max TPS

🆕 6 releases in 7 days 🛠 14 improvements & 16 bug fixes

We have created the Battle of Nodes incentivized testnet events to gain outside perspectives about how our blockchain operates in real-world conditions, understanding the usability of our tools from the perspective of the community of users, validators, and developers, and most importantly for testing its security.

The previous two Battle of Nodes events have brought us several insights in the form of issue submissions and vulnerability reports which we have used to further the implementation of our protocol to meet what we envision to be the security & performance standards of an internet-scale decentralized blockchain.

7 days into Battle of Nodes: Onchained and we exceeded the 1500 active validators needed in the first version of the mainnet for two shards and the metachain, and nodes can still join and leave seamlessly. We already have gained a better understanding of running on a full-scale network, thanks to our highly diverse community which was able to unveil improvement potential for our protocol to cope with very specific situations.

Learning 1: High TPS is not flood

Our initial analysis of the number of consensus messages during a round yielded a value of 1 — for both the block proposers and validators signing it. For the proposer, the only message in a round was the proposed block. The validators would then sign and send a maximum of 1 message in each round.

However, due to the specific security settings of the Metachain, whose consensus group size is equal to its number of participants, the number of consensus messages being exchanged in a high TPS environment in a very geographically dispersed environment helped us tweak our parameters to better cope with real-world environments.

Battle of Nodes chain monitoring

There is an edge-case found in synchronous systems, in which the current time view between peers might differ by a small amount, leading to a possible scenario when 2 messages might be received from the same peer id in one round. Also, there is a risk that one message could have been delayed so much, that it would reach the peers in the following round. For this particular reason, the maximum number of consensus messages received from the same peer id, in a round, was set to the value 2.

The latest updates brought a change in the maximum number of messages in a round by the same peer id. If the proposed block becomes too large, the proposer will send 2 messages: one containing the header, and one containing the block body. Also, a new optimization was brought to the table: after the leader gets all the required signatures, a special, small-sized message is broadcast on the consensus topic containing the aggregated signature, public key bitmap, and the sealing signature to make the current, executed block is being committed before the signed-sealed header reaches the whole shard nodes.

These changes got the maximum number of messages in a round to the value of 3. Since there are some situations in which an old message (from the previous round) can be received in the current round, the value is now being lifted to 4.

Given this, there might be a small chance in which the current leader will be the next leader, and if it will happen that the pool will contain 0 transactions, the block will be assembled very fast, in the order of milliseconds. This “early” broadcast message might hit the edge case again, lifting even more to the maximum number of consensus messages for peers per round to the value of 5.

TL;DR — our anti-flood mechanism was configured too tightly and flagged valid consensus messages as spam due to the high TPS environment in a geographically dispersed setting. As a result, we raised the peerMaxMessagesPerSec to 5 in order to avoid the consensus topic being blocked by the topic flood preventer component.

Learning 2: Redundant setups can do more harm than good

If a malicious proposer creates more than one valid block per round, by generating more aggregate signatures for the same accepted proposed block and broadcast them, it could create a situation where not all the network would receive/have the same block in the same round/moment. This situation also called (double sign) could trigger some forks with bad consequences for the network.

We actually saw this happen at round 6845 on the Metachain, where the block proposer with BLS key 6d264975f8ac…1f1dda3c581 was running the same key on two different machines. As a result all shards have stopped notarizing metachain. Because of that, they were unable to sync with the rest of the network. This caused the node reshuffling function to miss delivering the newly required nodes at epoch change, and therefore not enough validators arrived to the Metachain, to keep the chain running.

Detailed analysis of thespecific double-signing scenario encountered

As a solution to this particular situation, we have added a protection mechanism in the ForkDetector component, which should prevent such situations. Now, if there are two blocks with the same round and the same nonce, the fork is not triggered anymore, provided that in meantime a proposed block with a higher nonce has been received. This behavior will be slashable at mainnet.

TL;DR: A node operator deployed two nodes with the same key and triggered a specific double-signing scenario that caused a fork. We have addressed this by improving our ForkDetector component.

Necessary lessons from real world testing

We are convinced there are many other possible situations that can hinder the proper functioning of a blockchain. As we are deploying the first truly sharded architecture proof of stake architecture, we are the first ever to face certain specific problems, and thus the first to solve them. We are thankful to our community for their valuble contributions, and will remain vigilant for any other issues that might be discovered.

Please join our Validators chat to find out more about these particular issues and to try Elrond out for yourself. Join Battle of Nodes: Onchained, be part of the journey, and help run and improve our testnet, and earn your rewards.

See you on the battlefield.

--

--