The second upgrade of Regen’s public testnet, codenamed “Amazonas” happened Thursday, Sep 26 around 13:00 UTC. It went flawlessly and within two and a half minutes, the testnet was producing blocks again with the new binary. A big thanks to all the validators who prepared for this and were online for the switch (and those who were offline, watching cosmsod do the work for them). A great upgrade experience all around, and for those interested in the details, we dig into them in the article below.
This “Amazonas” upgrade builds on the previous “El Choco” upgrade, especially lessons learned there from some of the difficulties. We fixed a number of items, made the module more robust, and provided a better automation method, which we detailed in the Amazonas upgrade guide. One of the biggest failure modes in “El Choco” was people who upgraded one block early, and to prevent that, the Amazonas binary would refuse to run if launched too early.
We created a watcher daemon called cosmosd that can manage multiple versions of the xrnd (or gaiad, etc) binary and switch it out automatically when the upgrade time comes. We published a pre-built v0.5.0 xrnd image on github, as well as cosmosd, and many validators downloaded them ahead of time (with many more building from source). A number also set up cosmosd with both v0.4.1 and v0.5.0 binaries before hand, letting it run in v0.4.1 mode days before the actual upgrade.
And the day before, Regen Network CTO Aaron Craelius was busy checking which validators were active, and pledged to be online at the upgrade time, as there was worry about low turnout, which would freeze the network. A few redelegations to some of the committed validators also made it likely to be robust for low turnout (and issue much more common in testnets, than in a real mainnet).
The days before were spent getting everything in order to avoid any last minute issues. And it seemed like the learnings worked out.
We set the upgrade height in the proposal to 1722050, which is the last block valid with the El Choco binary. This was calculated to be close to 13:00UTC on Thursday, Sep. 26, 2019. A few hours earlier, we started checking block heights and giving updates on the DVD channel, so people could be ready an hour earlier or later.
It turns out the calculations worked out and the network was remarkably consistent over the last week. Block 1722051, was minted by tendermint at 13:00:42 UTC, and the El Choco release refused to process it. At that point, people calmly (over the next minute or two) either manually switched out the binaries, or let cosmosd do it for them. In less than two and a half minutes, we had 72% of the voting power online with Amazonas, all had now successfully processed Block 1722051 and were able to mint block 1722052 at 13:03:09UTC. This was 2 minutes, 27 seconds between blocks, certainly more than the average timing of 5.6 seconds, but quite fast to coordinate a breaking change, and even a simple migration step.
A few brave people trusted cosmosd fully and were even away from the keyboard. Easy2Stake took a coffee break then, as I (jokingly) suggested in my upgrade guide. And Alex Novacovschi was out of office, just wrote in on his smartphone to ask someone to see if his validator was still online after the upgrade height. Both validators made a fully automated switch, and didn’t miss one block. A great feature if the next upgrade happens to fall at 4am in your timezone.
We were also happy that the anti-fork protection code, as well as using block height instead of block time, had the desired effect of no one processing a block with the wrong version and getting a bad AppHash (which required lots of manual recovery the first time). The only issue was that a few people were offline at the time, but within an hour or so, all validators were upgraded and signing blocks, actually far higher responsiveness that we expected given chat participation leading up to the upgrade.
The main learning here is that robust software along with clear communication is essential to provide an easy upgrade experience. Ideally the software will refuse to do the wrong thing, and even try to do the right thing for you, failing if it hits any issues.
Providing pre-build binaries and clear documentation on the build and deployment process a few days in advance, along with engagement with the validators in communication channels, also help everyone be prepared and calm when the upgrade time came. Of course, this being the second time also means we were all better prepared and less nervous.
There were a few minor issues still, like using
$HOME in the service script and trouble building with go 1.13 (we failed to document that it must be 1.12 and not newer), but most of these got resolved before the upgrade, the rest shortly after. Just highlighting the utility of crystal-clear upgrade instructions and samples.
But most of all, this showed us that a blockchain-enforced height for binary switching, along with automation software to assist in the operations related tasks, provide a very smooth approach to running upgrades. This was by far the easiest and quickest upgrade based on a modern cosmos-sdk, designed to be accessible for any other zone built on the cosmos-sdk. (Irisnet has also done amazing work and demonstrated a zero downtime upgrade on their testnet, but this only works on their fork of the cosmos-sdk and requires some deep changes to both code and engineering practices.)
We also were happy with the fact that the new cosmosd daemon had no errors in the first public run. And a design- and test- first approach can also be applied to DevOps. Even with extensive unit tests and a dry run on our nodes, it was very nice to see it worked “in the wild”.
The Road Ahead
Having now performed two public testnet upgrades with this module, and refining it to the point that where it demonstrates it’s robustness, we are entering final review and discussions with the cosmos core team to merge this functionality into master. We hope this makes it in the next v0.38.0 release, so this can be available for all cosmos zones, and eventually the cosmos hub as well. (You need to have the upgrade module enabled in the pre-upgrade binary for this to work, so roll out on mainnets may take some months).
The only open feature we discussed was how to handle aborting the upgrade if the migration step fails. Currently, we call an arbitrary migration handle in the beginning of the first post-upgrade block with the new binary. But watching other events shows us we need a backup plan to roll back if this doesn’t work (rather than just getting stuck). There is a design document in progress, but the general idea is that the old binary can be configured to treat this upgrade as a “no op” and carry on in the old mode. Either by making an alternate binary (bugfix release off of the pre-upgrade binary with a no op handler registered), or adding a command-line flag to the old binary to skip an upgrade. If the new binary fails to start, nothing will be written to disk, and all validators can just revert to an alternate binary or the original binary with a flag, and then carry on as if nothing happened — no snapshots needed.
We look forward to finalizing this functionality in a way that supports all chains in the cosmos ecosystem, and invite anyone interested to join us in our developer and validator group, or join the regen testnet to see this in practice.