Developer Update: Lition Blockchain — July 2020

Hello Lition Community!

Since our last update, the Lition Development team has been keeping very busy, with a great deal of effort going into the code, as well as testing, improving, and stabilising the Testnet in preparation for an upcoming release. Over the last month, much of the focus has shifted towards optimizations, best strategies to roll out our blockchain, and bootstrapping our upcoming ecosystem of nodes.

All the effort has paid off. We’re approaching the end of the optimization process and we expect the launch of our first release to be just around the corner. This will be followed by opening the doors to external nodes.

Now, let’s dive into some details.

Q2 Development Recap

In the last dev update (link) we shared with you our progress in identifying and fixing issues rooted in the consensus mechanism that would cause the network to stall. Since then, our developers have gained a much deeper understanding of these issues and worked relentlessly towards finding a solution efficient and versatile enough for our intended use-cases.

During Q2 2020, the development effort was focused on pinpointing the different situations that were causing the IBFT consensus mechanism to stall. To achieve this, we created several local testnets in order to try different network setups of validators, to understand exactly when and why they were getting out of sync or what were the reasons that were affecting the liveness property of our consensus protocol. We analyzed different scenarios in which validators were running out of ETH or ending unexpectedly.

Although we initially wanted to deliver a fix as fast as possible, it became clear that the crux of the problem is deeply theoretical and fixes for it are non-trivial. For various reasons we will describe later, we decided that a better approach was to implement a mechanism that avoids the network losing its functionality whenever the blockchain stops progressing. In parallel, we continued to work on implementing and testing the definitive solution that fixes the inherent issues within the IBFT consensus model, which we plan to release once we validate that the testnet maintains stability for an extended period of time while we stress test every aspect of it.

Sidechain Restarts

As covered in the previous update, blockchain restarts were a result of validators no longer reaching consensus on the next block that would be added to the blockchain. After careful investigation in Q2, we verified our initial assumptions that this has something to do with the theoretical design of the IBFT consensus protocol.

We were indeed aware that such a problem was reported by other communities using the IBFT protocol as their basis for consensus. Most notably it was reported by several members in the Quorum community (e.g. in this github issue), but also by the Pegasys team developing Pantheon and the Clearmatics team. The issue is most rigorously described in this paper, by Roberto Saltini from Pegasys. We will do our best to summarize it below as well.

Considering the rather edge case scenario under which this issue arises we initially expected/hoped it would manifest a lot less frequently than it did in practice. However, after some tests and attempts at fixing it, we came to understand that a combination of factors that are specific to our use case actually increase the probability of this otherwise rare scenario to occur.

We attempted multiple times to remedy the situation by introducing fixes, which turned out to be very effective in lowering the frequency of the stalling. However, while the root theoretical issue was appearing less often, it was not as rare as it was claimed. Over time, our custom code became more functional and optimized for our needs. The expectation that the quorum base code would also soon be addressed in regards to the IBTF, led us to believe all variables were accounted for.

What is IBFT?

To explain the issue within the IBFT protocol one must understand its workings first. We will do our best to shortly describe it here, but if you are interested in this topic please refer to the theoretical description presented in this EIP on Github or to Quorum’s implementation (in case you are a developer).

The IBFT consensus mechanism is a set of rules that each validator follows in order to agree on what the next block to be appended to the blockchain is. The intention of this protocol is to ensure that the network progresses as long as a super-majority of at least 66% of the validators function correctly and truthfully. The IBFT protocol consists of two main components: a consensus mechanism within a group of nodes, and a list of nodes (aka. validators) that can be modified by voting to add or remove members.

Lition adds some complexity to this system as validators need to be permissioned to the network depending on the state of the Lition SC deployed on Ethereum, so validators need to track not only what nodes are online or not, but also who vested in the Lition SC.

The resulting list of validators is used to determine who can participate in the consensus and how many votes are needed to reach a 66% super-majority. Assuming this list represents the real state of the network, nodes take turns in a round-robin fashion to propose blocks.

1. Consensus mechanism

The protocol nodes follow in order to achieve consensus is depicted in the diagram below.

Note: F refers to the number of nodes that can be faulty (i.e. offline, malicious, or buggy). 2F+1 represents the super-majority of nodes (i.e. at least 66% of them).

The proposer constructs a block from transactions he has received from his peers and broadcasts the block in a pre-prepare message. In the meantime, other validators are awaiting the proposal. Once they receive a proposed block, they verify the integrity of the proposal (i.e. round number is correct, block builds on top of the previous block, transactions are valid, etc.) and broadcast their acceptance of the block as a prepare message. In the preparing state, each validator waits until he receives prepare messages from at least 66% of peers and subsequently transmits a commit message signalling to his peers he has locked in the block and is ready to insert it in his copy of the blockchain. Once a super-majority of nodes reach this state, each validator finally appends the block to the chain and moves to the next round of consensus, i.e. proposes a block if it is his turn to be the proposer or awaits a proposal otherwise.

To ensure consensus progresses when the node that is supposed to be the proposer crashes or acts maliciously, validators can broadcast a round change message when a round timeouts or when they detect an invalid block. If a node receives a super-majority of such messages, it understands that his peers came to the same conclusion and moves to the next round.

To ensure immediate finality, a validator that has locked in a certain block will not propose or accept any other blocks. The only way to unlock a block is to transition towards the state: Check Insert Result. This is in fact the main problem of IBFT as we are about to explain.

2. Validator list

Every block also contains the proposer’s vote on who to add or remove from the validator list. Each node (even non-validators) keeps track of these votes and when a member gathers a super-majority of votes, nodes update the list by adding or removing it. Validators can be voted out or in for reasons such as going offline or online. Additionally, in Lition they can be added or removed as they register or unregister as validators in the Lition SC.

The voting mechanism also recognizes a voting epoch after which all the votes are reset, basically starting from a clean slate.

What is the problem with all this?

Let us describe the problem inherent to IBFT which was analysed in detail in this paper and also this one. Any consensus algorithm must satisfy two properties: liveness (i.e. the blockchain eventually progresses) and safety (different validators must have identical blocks at a certain blockchain height). The paper we refer to above proves that basically neither of these properties are true in IBFT, but the good news is that they also propose a fix. We will not go into detail about how the safety property is broken because we have never experienced this issue and it arises only when the network is under a well orchestrated attack and a certain set of nodes experiences high latency. This being said, we are working on the fix proposed in the aforementioned paper which does solve this issue as well.

We are more interested in the liveness property as this is the crux of the problem the Lition side-chains are facing. Essentially, even with just one faulty (crashing) node the network might reach a state where two disjunct sets of validators lock on two different blocks while also losing the super-majority, therefore rendering the network in a deadlock. The IBFT protocol states that a locked in block can only be unlocked when a super-majority of commit messages are received, but such a majority cannot be reached within either of the validator sets.

For an example of how the network can end up in such a state, consider the following simplified scenario. Assume there are 8 honest validators and another one also honest but faulty (i.e. he will crash at a very crucial stage in the protocol and remain offline). Therefore, the super-majority threshold is 6. Consider the following sequence of events, keeping in mind that these nodes operate in an asynchronous network where messages might not reach all nodes in the same order or with the same latency.

  1. The proposer broadcasts a pre-prepare message with block B.
  2. The other 8 validators reply with a prepare message for block B.
  3. A set V of 3 validators receive a majority of prepare messages and lock on block B.
  4. Due to high latency, the remaining 6 validators (let’s call their set W) do not receive the messages fast enough and the round timeouts, forcing each validator in W to broadcast round-change messages.
  5. Furthermore, as only the 3 validators in V are locked on a block, there will never be enough commit messages to establish a super-majority. Thus, the validators in V also timeout and move to the round-change state.
  6. As all validators receive at least 6 round-change messages, all nodes move to the next consensus round.
  7. Let’s assume one of the 6 validators from W proposes a new block B’.
  8. As all validators included in set V are locked on B, they reply with a round-change message as in their view only B is a valid block proposal.
  9. At the same time, the validators in set W reply by broadcasting prepare messages for B’. Furthermore, let us assume that one of the validators (i.e. the faulty one) crashes immediately after broadcasting the prepare messages.
  10. The validators in set W receive all the prepare messages and lock on B’.
  11. As there are now only 5 validators that can emit commit messages, a super majority cannot be reached and all nodes move to the next round.
  12. Any new proposals will fail as none of the sets of validators hold a majority and only propose the blocks they locked on. At the same time, the faulty validator cannot be removed from the list because in order for votes to occur, blocks must be appended to the chain.

Under such a scenario, all validators are stuck in a loop where they continuously ask for round changes and consensus will never produce a new block.

Essentially, issues similar to the example presented here can appear whenever some minority of validators lock on a block while the remaining validators lock on another block while also losing the super-majority.

Lition Mainnet 2.0

As previously mentioned, a number of factors related to the custom implementation of our use case was drastically increasing the likelihood of this scenario to occur. Therefore, we also spent time over the last months working on fixing the behavior of this part of the code to ensure the implementation was robust and stable.

Now we’ll discuss the reasons we identified as the cause for our blockchain failing, and how we fixed them.

Validator Lists Synchronisation Fix

We realized that some circumstances were causing the list of validators at the geth level to get out of sync with the list of validators at the side-chain level. For example, if a validator V1 is removed from the Lition SC, nodes are instructed to vote V1 out and no longer accept messages from them. As proposers cast votes to remove that validator, the voting epoch might expire and the votes are reset. However, the nodes that already voted do not vote again in the new voting epoch, so the votes for removing V1 might not reach super-majority. As V1 remains in the validator list held in the side-chain the super-majority threshold does not change. However, his messages are rejected at the geth level so V1 can be considered offline.

To solve this problem, we simply changed the implementation of the nodes to keep voting on the membership of the validator list until it is in sync with the one from the Lition SC. Only after this synchronisation is achieved, the node starts rejecting/accepting messages from the removed/added validator. This means that a small delay is introduced when starting or finishing mining.

Overcoming IBFT Limits

Another issue we faced is related to the nature of our use case. As not all nodes are run by our internal servers, we cannot control when external nodes stop mining for whatever reason. Indeed, we must rely on the rest of the nodes to quickly vote them out as validators. However, if many validators leave at once there might not be enough remaining validators to guarantee a super-majority of votes.

For these reasons, we decided to re-create an optimal scenario and run more internal nodes controlled by us such that we can always vote validators out when they go offline

However, this proved to work against us in some aspects as the internal nodes had much better connectivity between them than with the external nodes. This made the rare scenario described above even more likely to appear. During periods of slow network propagation the external nodes would timeout while our internal nodes would not experience any issues and possibly lock on a block.

At this point it became clear that even if the consensus issues in IBFT are fixed, the risk of a stall due to 1/3 of the validators going suddenly offline still existed. To solve this design problem our strategy consisted of two main measures.

  • On one hand we developed a mechanism for un-stalling a sidechain in the case that more than a third of validators go offline at the same time. The mechanism we worked on and is now in testing will allow nodes to clean the validator lists at both the geth level and side-chain level thus allowing for a super-majority to be reached. To keep a decentralized spirit, this special behavior will be triggered through the SC when nodes vote for this measure to be activated. Nodes that do not vote on this issue within a certain time-frame are considered offline and they will not be counted when computing the super-majority threshold.
  • On the other hand, during the upcoming first stage of our rollout, we will build a solid network of trusted nodes that ensure that the latency is optimal as well as optimizing the consensus mechanism process. By building an initial robust ecosystem, the network will become more and more resilient against any kind of external attacks.

Other Custom Code Fixes

We did other refactorings and various fixes that were attempting to synchronise the validator lists between the SC, geth, and side-chain, they proved to be effective in stabilizing the overall network, but not overcoming the main IBFT consensus issue and the more inclusive nature of our network.

We decided to include a temporary feature to let nodes vote out validators that seemed to be disconnected for some minutes. In order to do this, we modified the sidechain’s internal smart contract (NetworkManagerContract) to include the timestamp in which we last saw each of the nodes.

This “keep alive” functionality let the nodes detect when anyone goes offline:

We then modified the behaviour of every node at the network service level to let them inform that they are online by sending a free transaction. This function included a feature to monitor all other nodes and vote out the first one that appears to be offline, but was still not included in its candidates list.

This will be moved to the Geth level in the next iteration.

IBFT, IBFT 2.0 & Tendermint

Initially our implementation for the consensus mechanism was mostly inspired from Quorum. It seemed that an update to their base code in regards to addressing IBTF was imminent. However, as time passed and the consensus issues raised in the cited papers went unchanged, we decided it would be best to move our custom code towards a different IBFT implementation where some of these issues are already addressed.

In anticipation of this possibility, we already completed a rigorous analysis on different blockchains that also use IBFT as their consensus mechanism. At the same time, we also decoupled Lition’s custom code in order to do the migration.

We eventually decided it would be best to shift towards another implementation of IBFT, also known as IBFT 2.0. This is currently implemented in Hyperledger Besu, with the help of Roberto Saltini, the author of the paper that identified the issues in the first place and proposed fixes for them. If you wish to understand the updated protocol we recommend consulting his paper.

Essentially, the change removes the locking mechanism and replaces it with a different mechanism for ensuring that once a node is committed to a block, that block will be selected. There are multiple options for implementing such a mechanism, each with their own advantages and drawbacks.

The simplest approach is inspired from Tendermint. In this approach the update consists of two changes. Firstly, the round number of the last locked block is added in the pre-prepare message if there is a locked block. Secondly, a validator that is locked on a certain block can unlock and disregard the block if it receives a super-majority of prepared messages with a round number r higher than the current round number and a pre-prepare message also for round r that contains a locked block.

Such a change requires considerable testing to verify that no other obscure issues arise. Our goal is to release this major update after that extensive testing is completed.

Wrapping Up

Despite the unique challenges that we’ve faced, we’re still on track and approaching a very exciting stage for the project. We continue our work confidently, knowing that with every line of code we come a bit closer to our mission of bringing much-needed enterprise blockchain solutions to the world.

There is a lot of excitement ahead and we can’t wait to share in it with you. On behalf of the entire team, we sincerely appreciate all of your support.

One Last Thing

All the above has been keeping us really busy, but there’s more.

We’re happy to share that we will be opening up node registration to select members of our community next week!

If you would like to participate in the network and be part of the journey very close to us, keep an eye on our official Telegram or Twitter so you don’t miss the announcement!

--

--