Bluzelle Triggers and Helps to Resolve Important Bug in Tendermint Consensus Engine

Bluzelle has built a Decentralized NoSQL database using Cosmos+Tendermint blockchain technology

Neeraj Murarka
The Blueprint by Bluzelle
4 min readJul 13, 2020

--

We launched a validator uptime competition in early June 2020, and completed it at the end of the month. It was a very successful event, with over 220 validators participating, for over 3 weeks, to achieve the best uptime and voting participation possible. Much was learned by us during this experience. We built a new validator explorer and redesigned our security architecture, as a result of some of the incidents that occurred. One of the more serious incidents is discussed here.

In the first week of the competition, the entire blockchain ground to a halt, due to an error, described briefly, as follows:

This error appeared to be non-recoverable. Attempts were made to restart validators, without success. It appeared that a certain block could not complete the Tendermint consensus process. Validators that had not been manually stopped were paused, unable to go further. Restarting a validator resulted in a runtime error, with the validator process immediately terminating. This was clearly a serious situation.

We were using the latest stable version of Tendermint. As far as we know, not all projects were using it — some of the larger blockchains (including Cosmos’ own HUB-3) are using an earlier version. Also, we had a big 200+ node group out testing our network. All in all, there was no clear reason why we specifically triggered this issue.

I reset (with a new genesis) and restarted the chain and instructed the validators to reconfigure themselves with the new genesis, and to restart their nodes. Unfortunately, within a few hours, the newly restarted chain also ground to a halt, with the same error, albeit at a much earlier block.

At this point, it was quite worrisome. I had no explanation nor a solution to a problem that was repeatedly manifesting itself. Without any solution, I decided to reset and restart the network once again, fully aware this problem could very likely arise again. I also filed an extremely detailed bug report with the Tendermint team:

The team asked for detailed logs, but at this point, it was not possible, as we had restarted the nodes and any useful logs for the incident were purged.

Fortunately, the issue did not arise again for about three weeks. On the evening of June the 27th, the chain yet again ground to a halt, with the very same error.

This time, knowing the Tendermint team will need “forensics”, I kept my validators running, and submitted the output from the /dump_consensus_state RPC end point to the Tendermint team.

By looking at the last_commit field, it became clear that one of the signatures was for a different block than all the others. It was obvious that proposers were including any signatures they had seen in commits, not just ones for the correct block.

The highlighted signature was for another block with the timestamp that is a month earlier than others

This error was initiated by restarting a testnet without changing the chain ID: Correct block proposers would inadvertently include signatures for the wrong block when they saw such signatures, and as a result, these commits would not validate. Subsequently, all proposed blocks would be considered invalid, and the network would halt.

This situation nonetheless revealed a vulnerability that an attacker could exploit. A malicious validator could sign for incorrect blocks, and the block proposer would blindly include these signatures in the next commit, causing the block to be invalid and consensus to halt forever as no correct validator would vote for the invalid block.

The Tendermint team has walked through this issue in details during their recent Dev Session. You can find out more from 9:20 onwards.

Thanks to quick actions by the Tendermint team, within only 48 hrs, they were successfully able to reproduce the problem and identify the cause.

Several recommendations were made to prevent the problem from occurring. Furthermore, a solution is being developed to recover a stalled chain and prevent a stall such as this from occurring again in the first place.

Although the network “crash” was a very stressful event and a headache for our validators to deal with, it led to a great outcome. We gained valuable experience understanding better how Tendermint operates. The specific diagnosis and troubleshooting processes with Cosmos+Tendermint were new for me. Working with the Tendermint team to resolve the matter was also a great experience — they offered quick and helpful steps to remedy the situation.

Probably most importantly, I am proud to say that Bluzelle was able to identify and help resolve a serious problem that hopefully will lead to valuable refinements in Cosmos+Tendermint best practices. It has already resulted in a fix to this problem, with a new version of Cosmos+Tendermint being rolled out several days ago. These results are a big win for the entire ecosystem.

--

--