Desmos September upgrade postmortem

Riccardo Montagnin
Desmos Network
Published in
4 min readSep 29, 2020

Sharing is caring.

It’s for this reason that today we would like to share with everyone what happened during the Desmos September upgrade. Particularly, we are going to describe how this on-chain upgrade made our chain unusable and how we plan to fix the problems we are now facing hoping that everyone else can learn from our mistakes 🤝.

Upgrade procedure ⏫

The on-chain upgrade was planned to happen on September 25 at 08:00 UTC.

Between all the others features, we also included a new module called relationships allowing users to “follow” each other by creating mono-directional links between them. It’s important to mention this because we will refer to it again later.

Once the time arrived, the upgraded procedure started properly.

The last block produced before the upgrade was block 513,573 (timestamp: 24 Sept at 8:00:09 UTC). The first block produced after the upgrade was block 513,574 (timestamp 24 Sept at 08:54:10 UTC).

As you can see there was a downtime of 54 minutes and 1 second. This was not planned and it verified due to a bug in the upgrade procedure. Our team immediately identified the bug and pushed a fix to solve it as well as a script to make sure validators could fix it in few seconds.

Once the script was released, all the validators that were coordinating with our team on Discord ran it. This allowed them to fix their node and be able to start the chain properly 🥳.

What happened after ⚠️

Everyone thought that the upgrade was successful. The chain was producing new blocks properly and none had any issue with it.

Then, one validator tried querying some data from the chain. It was not possible. All queries returned the same error:

invalid request: failed to load state at height X; version does not exist (latest height: X)

We began to investigate this immediately. Since this was very similar to another error message that the Cosmos SDK returns in other occasions, our team contacted the Cosmos SDK and Tendermint development teams.

We started by taking a look at the migration code that our team wrote for the September upgrade, but nothing was wrong there. We then gave the Tendermint team access to one of our validator node allowing them to run a script to inspect the local storage of the node to see if something was wrong.

It turned out that the bug was actually present inside the Cosmos SDK itself. Apparently, the x/upgrade module was not tested to see if it handled the registration of a new module during the upgrade (read more about this here).

F.A.Q

What is this causing? 🐛

Since this bug doesn’t allow the node databases to load properly, it consequentially doesn’t allot any query to work.

This makes it so that both our faucet and Mooncake do not work. Also, anyone trying to query the chain or perform a new transaction will see the same error depicted above.

The chain is producing blocks, but it is not writable nor readable .

How do you plan to fix this? 🔨

Currently, the Cosmos and Tendermint teams are working together to find a solution. You can see this from the discussion on the issue our team opened.

Once they find the solution, we will wait until they release a new Cosmos SDK that includes the bug fix inside it. Then, we will release a new Desmos version and plan a new on-chain upgrade to fix this.

How long will it take to solve this? ⏱️

We actually don’t know this. It all depends how much time it will take the Cosmos team to find a solution to this problem. After the solution is applied with a new Cosmos SDK release, it will take us probably one or two days to release a new Desmos version and create the upgrade proposal.

We hope to solve this by the end of this week, but we cannot guarantee it.

Will this impact the Desmos validators program rewards? 💰

No, this will not impact the scores of the Desmos validators program.

Since this is our fault, are not going to penalize validators. For this reason, we decided to not count all the blocks that have been and will be produced until this issue is solved, inside our rewards calculation. All validators will have an equal chance of earning their rewards.

We hope to have shed some light on what is currently going on inside our chain. We are working very hard to find a solution to the problem as soon as we can and make sure everyone can start using Desmos properly again as soon as possible.

We would like to thanks once again all the validators that have supported us on Discord during the upgrade procedure and have helped us identify the bug.

Also, thanks to everyone who every day tries to become a new validator of the chain. We are truly sorry about the current situation and we will work to make sure you can become part of the chain as well 🙏.

We will update our community though Twitter, Telegram and Discord so make sure to join those communities if you want to keep updated.

Failure is instructive. The person who really thinks learns quite as much from his failures as from his successes.

— John Dewey

To know more about Desmos and stay updated, please follow:

Telegram | Discord | Twitter | Instagram | Website | GitHub

--

--

Riccardo Montagnin
Desmos Network

I’ve got too many places where to write my bio, so if you wanna see the updated one go to Twitter: https://twitter.com/ricmontagnin