TgradeFinance
Published in

TgradeFinance

Tgrade Testnet-3

What we learned and what comes next

Photo by Xavier von Erlach on Unsplash

The plan for Testnet-3 was to launch in stealth mode in December to make sure everything was running as expected, and then early January invite a small number of validators to join the network. This would help us to test out the code in the wild, gather feedback and fix bugs before making a major testnet announcement. We were very pleased with the response and the support we received from the whole community around Tgrade.

We were overwhelmed with the response from our validator community, we had almost 400 validators take part, and it has really helped us figure out issues and help clarify how things work with external nodes, and improve the process and tools are running a public blockchain.

The Oversight Community has been busy testing the governance tools; adding and removing members, punishing (slashing and jailing) validators.

The Trusted Circle Trading challenge had a good response, with people asking to join and are busy trading.

We found some issues with the block explorer that we fixed, some UI issues and it really helped in tracking down a few issues, like poor UX when many validators don’t fit in the active set.

We have also been upgrading the network with smart contract deployments and testing the validator voting, which has been working as expected.

We spent much of the last two week analysis these bugs and rolling out patches.testnet-3 metrics: 988 addresses created, 6992 transactions, average block time 5.6s

Hot patching governance

Through the process of running Testnet-3 we found some bugs and missing features in the governance contracts. Since we implemented Proof of Engagement as a series of smart contracts, we wanted to take advantage of the migration capability to “hot patch” these core components, not only to fix the issues, but to gain experience with this process on a live system.

We had configured the validator voting contract as “admin” of all the core contracts, allowing it to migrate them to new wasm code upon a successful vote. The last week of January we uploaded a new Community Pool contract that fixed a critical double-spend issue, and a new version of the Oversight Community voting contract including a new proposal type to “unjail” a validator early. Both proposals passed the validator governance, and upon executing these proposals, we were happy to find they kept working and the bonus of the improved functionality.

Stop a minute to let this sink in… we were able to roll out fixes to the core on-chain governance in a matter of 2 or 3 days, and without any validator needing to update binaries or run a single script. This is cutting edge stuff and a world first in Cosmos.

Hot Swapping Consensus

There was also a lot of feedback about validators falling out of the active set, and we wanted to improve handling large validator sets by raising the limit to 125, and adding a query to quickly show if your validators is in the active set. After seeing how well this worked for governance, we decided to deploy some changes to the valset contract, the one that updates the validator set at every block.

Some time had passed, and we had accumulated about 4 different PRs on the main branch related to valset, changing the validator set during migration, adding a new proposal to change it anytime, flagging active validators, adding pagination to validator lists. We had tried to keep these non-breaking and backported them all to a branch that was supposed to be compatible with the running chain. The API looked the same, so we decided to roll this out quickly on the heels of the other successes, just increasing the validator set to 125 to start.

This is when hubris struck… rather than double checking all the diffs or dry running this, we went right to the testnet. The proposal passed, and at 9:24am UTC February 2nd, we executed this proposal. The app looked like it kept working, but rather than 125, the validators list showed 0.

Photo by National Cancer Institute on Unsplash

What did the doctor discover with Testnet-3?

Although the app looked live and we could see all the proposals and engagement points and markets, we were unable to submit a transaction. Then our DevOps guy told us the chain had stopped making blocks and if we knew what happened. Looks like we bricked the network.

While trying our queries, we found one failed with parsing OperatorInfo. Digging into the diffs, we found we had indeed added a field to state, which was not in the on-chain data, so parsing fails. That means any attempt to load the operator (mapping Tendermint keys to tgrade addresses) would fail. Which happens here when calculating the next validator set. Rather than return error (which would have crashed a node), this code filters out errors, so we return an empty validator set.

The empty set was passed to Tendermint, which happily accepted it, and all the nodes kept running, but no one was able to make blocks. While this is an interesting state to place a blockchain in, it is not very useful if you want to interact with it.

While there was extensive testing in place around the code, and both the old code and new code worked properly, the new code did not work on the old data. In our excitement to upgrade the testnet, the breaking change was not spotted. The testnet environment is exactly where we should be finding bugs, as they typically run for a longer time and tend to be more patched than the internal or staging testnets.

We are reviewing our processes to ensure this will never happen on mainnet. Both, how we review/flag API- and state-breaking changes. As well as how we test migrations before rolling them out on a live network. As well as tooling on how to recover such cases quickly before mainnet. Of course, we will also provide some safety measures for this particular case.

What comes next?

Photo by Nick Fewings on Unsplash

We are going to take some time to plan the next testnet, which will be called patchnet.

Since the launch of testnet-3 we have been doing some major changes to the codebase, and rather than ship a fixed version of testnet-3we want to do a review of what has been built, what is being built and what makes sense to release as a testnet. In particular, we want to include an upgrade to Cosmos SDK 0.45 and the latest wasmd to get more experience with those changes in a public environment.

Our initial thoughts are to release patchnetas a semi-public testnet where we invite 10 or so validators to help us run the network. The whole community will be able to try out the application, and use T-Market and Trusted Circles. The Oversight Community can continue to test out the governance process. And we can improve our QA process for rolling out important updates.

patchnetis planned as a late ♥️ Valentine present to the community on 15th February,

Our objectives for patchnetare to continue testing in a realistic environment, practice deploying new releases, and test out the new Tgrade app features.

patchnetwill be succeeded by the dryrunnet, and we will be letting everyone know about that once we have firm plans in place.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Martin Worner

Martin Worner

Growing Tgrade, a business focussed, public blockchain, which solves real world issues.