The El Choco Experience
Looking at El Choco upgrade: the good, the bad, and the future
The best news is it worked! And we were up and running again in less than 2 hours.
Block 953627 was produced at 15:00:09 UTC on August 6th, 2019 and all validators halted on BeginBlock. This block was never run with the v0.4.0 software and no validators started on the next consensus round. Around 2 hours later, over 70% of the validators were properly synced up on v0.4.1 software and block 953627 was finally executed, opening the door to minting 953628 and beyond. There was no chain restart and the header AppHash shows that the v0.4.0 software did not execute any block after 953626. Block 952628 was minted almost exactly 2 hours later, at 17:00:03 UTC .
Many validators found the experience quite painless, far less than the manual halt, state export and reset chain experience. In fact, within 15 minutes we had 54% of the stake prevoting on block 953628 with properly upgraded v0.4.1 software. Especially considering it was the middle of August and conflicted with another testnet upgrade so there was low turnout, this was quite an impressive turn-around time.
The bad news is that a number of validators fell out of consensus during the upgrade. They upgraded a few seconds before the others and executed block 953626 with v0.4.1, which lead to a different AppHash, causing Tendermint to complain and stop the node. We had not anticipated this, and as we rapidly worked to build some recovery tools and sync a “proper” app state, a number of validators looked in frustration at stuck nodes. The truth is we were not prepared for this case.
In a deeper post-mortem, I discovered that block 952626 was actually timestamped at 15:00:03 UTC. I am not sure how this got executed, but also it was using an ancient version of the upgrade module (the pre-upgrade code) and may have had a small time comparison issue. We had not detected this issue in dry runs, and should investigate if it still exists in the current code. In any case, many validators who very correctly switched the binaries at exactly 15:00:00 executed block 952626 with v0.4.1, while all those who waited for the blockchain to auto-halt executed that block with v0.4.0. I want to apologize to all of you who made the extra effort to be punctual and ended up with a headache.
If it wasn’t for this issue, the upgrade would have been complete with over 70% voting on the upgraded chain within 15 minutes. But it also helped teach us a lot, so we can make sure this is much more robust in all future versions. And I want to give a big thanks for everyone who stuck around and helped out with tech support as we worked to help those validators recover and get back in sync.
While we had tested the upgrade code and performed a few one- and two-node dry-runs, this was the first time we tried this code “in the wild”. It was a bit of a rough experience, but we learned a lot, and are working hard to make sure this all goes better in the future. Besides details like debugging the block time issue, here are some of the main learnings:
Upgrading early is as dangerous as upgrading late
While the code did contain checks to ensure that we never run the old software after the upgrade plan kicks in, we had no checks to ensure that we never run the new software before the upgrade plan kicks in. This is almost as important, and the future version of the upgrade module will have this check implemented, along with better detection and error messages.
We need better automation tooling
We tried to enable automation hooks by calling out a shell script in a subprocess. This script was supposed to be able to copy over the new binary and kill the parent, so systemd will start up with a brand new binary. This worked fine on dev machines, but we quickly realized that almost every validator ran xrnd as a user process, but only root could overwrite the xrnd binary. So clearly a subprocess just didn’t have the permission to do so (besides this all being error-prone).
We are now designing a more complex watcher daemon, or supervisor, which will launch the xrnd process. This watcher will detect when an upgrade is needed and have write access to update the binaries. But it can run the xrnd process with lower privileges. Not only does this get over the permission hurdle, we can standardize how the watcher process organizes it’s sub-directories (just as xrnd manages its data and config directories) and automate the upgrade within this system much easier.
We need better recovery tooling
We were not ready for nodes falling out of consensus early. It took us a bit to detect what the issue was, and far too long to recover from it. We wanted to roll back the app state one block and then re-run with the old software until it quit, then manually upgrade. Since tendermint will never let you run more than one block out of the consensus, and we have 20+ blocks of history in the iavl store, we can always go back to the “safety of consensus” and try again. It is a bit tricky to do so managing many iavl substores of the RootMultiStore, but the design is straight forward, and I did write a prototype in about an hour. We will work out better tooling and documentation for handling any such case that arises in future upgrades.
Once again, I want to deeply thank everyone who participated in this first real-world upgrade experiment. We have shown that in place upgrades are a possible approach and potentially easier than the dump state and reset chain approach. We also discovered that we need much better tooling and checks to make this approach production ready.
A much-improved upgrade module and associated tooling are a major focus of ours in the next few months, and we hope to merge much of this functionality into cosmos-sdk to share with other chains. Our main internal goal is to have this functionality battle-tested and very robust before our mainnet launch, allowing us to iterate quickly and update the testnet with new functionality needed by the various ecological projects we will be supporting.