BandChain Laozi Mainnet Upgrade — Post-Mortem

Sorawit Suriyakarn
Band Protocol
Published in
4 min readJul 14, 2021

TL;DR: Our oracle remains active and secure, no partners were affected, and no funds were affected. The Mainnet launch of BandChain Phase 2 will have a revised timeline as the team eliminates the issue moving forward.

Summary

On Wednesday July 13 11:02 UTC, the Band Protocol team along with the public validators running BandChain halted the chain and began to prepare for the Laozi Mainnet upgrade.

When it came time for the new Laozi chain to start on July 13 17:02 UTC, several validator nodes failed to connect to their peers. This caused the new chain to eventually lack sufficient voting power to continue proposing new blocks.

After discussion with the validators, the core development team decided to retry the migration process using a smaller set of peers. Still, certain nodes ran into a similar issue as before, along with excessively high memory usage on some of the validator node instances.

After attempting multiple potential solutions (detailed below) that were discussed with the validators, the core development team decided that continuing the migration process would be too imprudent. Given the unexpected errors that occurred, the team reached a soft-consensus with other validator nodes to roll back to the GuanYu version. This consensus was made to ensure the safety and continued operation of our network along with the user’s funds.

Throughout the entire process, our oracle remains active and secure, no partners were affected, and no funds were affected. BandChain and our oracle continue to operate normally, and we will announce a new date for the Phase 2 upgrade.

Detailed Timeline

July 13th, 2021:

  • 11:02 UTC: BandChain GuanYu network halted at block 7486289 to begin the process of exporting the current state to the new genesis file for the Laozi upgrade.
  • 13:18 UTC Laozi genesis file has been agreed by the majority of validators. Laozi nodes were getting ready to produce the first block at genesis time.
  • 17:02 UTC We reached the genesis time. Each node tried to process the genesis file and signed on the first block. The first block was mined within 1 minute after the genesis time, but the network could not reach the consensus on the next round. Less than 2/3 of validating power had come online.
  • 17:21 UTC Core Band development team discussed with validators and found that some majority of the nodes encountered memory issues due to a lot of connection on peers and the size of blockchain state.
  • 17:42 UTC We opened a soft-consensus vote with the validators whether to roll back to Guanyu immediately or try to relaunch Laozi with fresh peers and address book.
  • 17:49 UTC The validators agreed to try to relaunch the Laozi network again with different blockchain configurations.
  • 18:00 UTC Laozi relaunch genesis file has been confirmed.
  • 18:20 UTC Some nodes were still experiencing a memory issue. Validators attempted to add swap space, which may help prevent out of memory issues.
  • 19:04 UTC At 2 hours past the intended genesis time, the core team along with the validators agreed that the risk of upgrading to Laozi at this point is too high. The core team and validators proceeded with the rollback plan as stated in the upgrade proposal.
  • 19:58 UTC GuanYu Mainnet resumed and the rollback was successful.

Course of Action and Next Steps

While the operation and security of the Band Protocol oracle feeds were not affected, the migration issue revealed several oversights in testing, coordination, and some internal processes that were previously unidentified. The team acted quickly with our validators to determine the best course of action, a roll-back to the GuanYu version.

Throughout the process, two main issues were identified

  1. Differences between the testnet environment to the production Mainnet
  2. Consequently, unexpected requirements necessary for the upgrade process

While extensive testing was carried out ahead of the migration through two Laozi testnets, neither sufficiently simulated the Mainnet environment. Moving forward, we will be performing a fully simulated export/migration process with all validators against the full Mainnet to a temporary Laozi main network. This will allow us to address and resolve the issues faced during the migration process, as well as to catch any other issues that might be present.

An equally important point is to ensure that all parties involved are aware of the specific requirements and caveats associated with the upgrade. To that end, the core team will be revisiting and testing every component involved in the migration to ensure that every validator has a complete understanding of each process and its requirements. This includes further inspecting the effect of the GuanYu Mainnet state size and the Stargate implementation details on the hardware requirements of the nodes as well as any other impacts it may have.

Closing

To close off, we would like to reassure all parties that no funds were affected, no one using our oracle data was affected, and our oracle has remained active and secure through the migration and rollback process.

Our team is investigating the issues closely along with following the next steps outlined above for Band Protocol to emerge stronger and more resilient.

Going forward, we will be revising the timeline on the plans for our chain and will be sharing more details with the community in the upcoming weeks.

Unlisted

--

--