Blockchain Tales: Our Mainnet Stalled; This is How We Rescued It
Early morning CEST on Monday, October 19, 2020, we detected an issue on Nodle’s main network. It appeared that all our validators were offline and block production had stopped. This seemed to be due to a generalized issue on our infrastructure located on Google Cloud. After restarting the validators, block production was still not moving, and we encountered consensus errors. This story covers what happened and how we solved it.
First, There is a Consensus
A typical production chain built with Parity Technologies’s Substrate Framework produces and finalizes blocks by combining two consensus algorithms: babe and grandpa. Babe produces blocks that grandpa finalizes. In our case, block production had halted, so the issue was related to babe.
For the non-initiated, babe produces blocks by epochs. It expects at least one block to be produced during every epoch, but usually, more of them are produced. These epochs depend on a universal parameter we will all know and use: time.
In our case, since all our validators had been stopped for too long, we had skipped one or more epochs, and thus had broken the assumptions of the consensus algorithm.
Then, From the Ashes, the Chain Returned
We were in fact able to find one prior occurrence on the Kusama Network of a similar issue. It happened in January 2020 and was detailed in this blog post by Gavin Wood. The option Parity had chosen was to revert the chain by a few blocks and create some sort of ‘time machine’ on the servers. Unfortunately, though, we were unable to exactly understand the requirements and the best ways of doing so.
Another solution could have been to revert the chain by a few blocks and configure all the validators’ clocks to make them think they were running in the past. This was possible in our case since the Nodle Chain still runs as a ‘Proof Of Authority’ network of which we control the nodes. However, we deemed this option as impractical as we were still not able to understand all the requirements to do so. Further, the probability of the chain restarting after these efforts wasn’t thought to be 100%.
After talking to a few people from Parity and other teams from the Substrate Builder Program, we came up with an alternative and simpler plan. We would fork the network. This involved the following:
- We would write some runtime migration code to make sure that any scheduled item was updated to correct block numbers (for instance, we had to recompute some vesting grants schedules).
- We would duplicate the state of the main network and create a new chain spec for the new network.
- We would test the new chain spec to make sure it was identical to the stalled network’s state.
- We would issue an upgrade to the Nodle Node software to include the chain spec for the new network.
- We would stop our validators and restart them on the forked network.
Finally—Phoenix is (Re)born
This is exactly what we did. We duplicated our chain’s state by using a script called fork-off-substrate (which we had to slightly modify). We pushed a few pull requests and docker containers. We also kept a branch with our working changes for reference.
We then turned off and back on our validators with the updated node while keeping the old chain data on some servers (in the event we need to check it again). The network was now running smoothly on top of the old data… We simply had to register the new validators again and we were done.
We decided to name this patched version of our node “Phoenix,” in reference to the process of destroying the previous network — to then restarting it using what was left behind from the previous one.
What Does This Mean for Nodle Cash Holders?
Since we duplicated the previous chain’s data, all balances and transactions have been preserved. We made sure they are not affected by the changes. If anything, Nodle Cash holders should keep an eye for new updates to the wallet software that make it even more stable, by using reinforced nodes.
What Does This Mean for Nodle Chain Node Operators?
Nodle node operators will need to update their nodes to the latest version available on GitHub so that they can synchronize the new fork. Command-line options have been preserved and do not need to be changed; this should be a simple update for most of them. One thing to keep in mind is that the network ID changed, and thus the data will be stored in a new subfolder with the new ID as a name.