Dev Update #116-#117

THORChain Dev Update for Weeks 116–117, 08 Nov — 21 Nov; Consensus Failure Details, Back Pay Processed, MCCN Updates, Community Updates

THORChain Author
THORChain
7 min readNov 23, 2021

--

Summary

The network was halted due to a consensus failure but is now running. Backpay was processed, churn was delayed. MCCN updated, and community updates.

Consensus Halt

On the13th of November, THORChain entered a consensus halt. This meant that nodes were not agreeing on the state of the chain in a given block. This may be the first time a 2/3 consensus failure has happened to many/all validators on a Cosmos BFT chain.

The root cause was that nodes hit an exception and produced different states in a line of code that returned an error in a mapping iteration. The error was because a pool was purged that a node was still holding and which tried to leave. The fix was to replace “return” with “continue” & use a slice instead of mapping. https://gitlab.com/thorchain/thornode/-/merge_requests/1995.

The use of all maps have since been removed. https://gitlab.com/thorchain/thornode/-/merge_requests/1997

However, an update could not be implemented while the chain was halted. On the 14th of November, three options were created to restart the chain and resolve the consensus failure.

1) Rollback — in recent weeks, the Cosmos/Tendermint dev teams have been working on an ability to rollback a single block to resolve consensus failures like this. Unfortunately, this feature isn’t fully functional, as it is only halfway implemented. THORChain core developers had been working on the other half, as this would not only be a good tool to have in THORChain, but the larger Cosmos community. Core developers had been working with Cosmos developers to finish this work, but hasn’t been fruitful so far.

2) Resync — each node resets THORNode data directory and resyncs just the chain data. Some nodes have their own snapshots they can restore from, some may use a downloadable snapshot to restore from, or a node can resync from scratch. This has been experimented with and found to be a successful means of recovering the chain.

3) Soft Fork — this is where the old chain effectively dies, there is a full export of critical data (but not all data), and imported into a new chain with the same validator set (and TSS keys). This is the most “nuclear” option.

While option one was preferred, offering less risk and could be quicker, after many hours of trying, the THORChain core developers along with the Cosmos developer team were unable to get the new rollback feature functional. The effort was time-boxed so as not to spend too much time just spinning their wheels on this approach.

THORChain core developers agreed that the safest and most efficient way to get the chain back up was doing a resync (option 2). Each node will pull down the patched code (v0.74.0) and then resync their node and get their node producing blocks again. A careful plan was put together by the NineRealms team for node operators — https://www.notion.so/THORChain-Recovery-8c241fd9524f486698083e3f81866ce4.

On the 16th November, the plan from NineRealms team was delivered and followed by node operators and the resync started.

Option 2 relieved two additional complications.

  1. 2/3 of updated nodes need to be selected to commit the next block and the validators are selected randomly from the entire validator set. If one of the nodes randomly selected hasn’t patched their node and re-synced, the commit attempt won’t work. There was nothing that could be done except wait for nodes to update — which are independent and geographically distributed.
  2. Once the nodes were re-synced, there were outstanding voting rounds to be completed. Each round per block gets its own consensus. 1056 consensus rounds of voting needed to occur, for each node when it joined — and the time for each round to occur increased as more nodes were re-synced. While the process worked for MockNet (testing with 4 nodes) that didn’t have a consensus failure, it is very different in Chaosnet with 38–40 nodes updating at different times, which did have a consensus failure.

On the 18th of November, in coordination with the Tendermint team, the core team were able to figure out how to speed things up. The chain restarted on the 18th of November.

On the 19th of November, all nodes caught up to the tip of the external chains and normal trading could resume.

A full post-mortem of the incident has been created here. https://gitlab.com/thorchain/thornode/-/issues/1169

NineRealms and Core Developers apologised for rushing the restart to Node Operators. There was a lot to juggle, but all are persevering to deliver. Slash Points/Jail, these are annoying, but in a stable churning network mean little. Slashed Bonds are serious and devs will look into them all.

NineRealms and the core team did greatly appreciate the assistance given by Cosmos and Tenedermint development teams as well as the community as a whole.

Churn, Shard, and Backpay

A churn was planned, the first time since 13th July, that would allow the back pay for LPs during the last halt and allow churning to return to normal:

The first churn is expected to be uneventful, the following churn is where some magic happens. This is where (for the first time in THORChain history) the network will have more than one Asgard (TSS) vault. This ability was added to MCCN as a mechanism to remove TSS as a limiter/bottleneck on the number of nodes the network can support. This helps more capital bond into the network, creating more security, as well as push the network further towards decentralisation.

Note: some node operators have been slashed in the last few months. A subsequent form will be provided for these NOs to get a refund and help devs debug the slash event.

This is a major milestone in the effort taken by the community in response to incidents during this past summer. The network is stronger and more resilient, and ready for the next stage of growth.

As a result of the consensus halt, churn and shard has been delayed until the 29th of November but Backpay was processed as below.

Chain Halt Refunds

* Guided: https://pastebin.com/aXJSUtAj

* Processed: https://pastebin.com/pGrShUSC

* Pool donations: https://pastebin.com/RfTgdLZe

Node Slash Event Reports

* Please fill this form: https://forms.gle/QJeBsAxwXwPLzMG89

Due to delayed churn, all standby operators will get 50% of the reward of active an active node to stay in-standby until the next churn.

MCCN Updates

THORNODE UPDATE 0.73.0

1) [BUG] Update token_list.json with the RUNE ERC20 token address on testnet. PR: https://gitlab.com/thorchain/thornode/-/merge_requests/1987
2) [BUG] ETH chain transaction out of gas cause outbound didn’t get send out. PR: https://gitlab.com/thorchain/thornode/-/merge_requests/1986
3) [ADD] Rename lite nodes to vault nodes. PR: https://gitlab.com/thorchain/thornode/-/merge_requests/1984
4) [BUG] ILP calculation should take the block height in which the pool became available. PR: https://gitlab.com/thorchain/thornode/-/merge_requests/1983
5) [BUG] ILP calculation bug with synths. PR: https://gitlab.com/thorchain/thornode/-/merge_requests/1980
6) [ADD] Prevent the network from removing chains. PR: https://gitlab.com/thorchain/thornode/-/merge_requests/1978

release: https://gitlab.com/thorchain/thornode/-/tags/v0.73.0

THORNODE UPDATE 0.74.0

1) [bugfix] maps are evil. https://gitlab.com/thorchain/thornode/-/merge_requests/1995

release: https://gitlab.com/thorchain/thornode/-/tags/v0.74.0

THORNODE UPDATE 0.74.1

1) [ADD] By pass double vote block slash logic. https://gitlab.com/thorchain/thornode/-/merge_requests/1996

release: https://gitlab.com/thorchain/thornode/-/tags/v0.74.1

THORNODE Disk Size Increase

Node operators, at the moment thornode volume capacity is 300G, the chain data is at 240G, so 80% of the disk is filled. Time size up thornode disk so it has more room to grow.

Community Updates

THORNoob

Flipside

New Analytics Dashboard produced by Flipside!

THORChain Monitoring bot

■ A new type of alert that signals when THORChain stops producing new blocks
■ A new type of notification in case the block production rate is significantly lower or higher than the standard (WIP)
■ THORChain Block time estimation + chart
■ Small improvements for the deployment process

THORGuards Chrome Extension

Created a Chrome browser extension that displays rarity rank of THORGuards NFT directly at the OpenSea website.

Download/code/instructions:
https://github.com/tirinox/thorguards-rarity-chrome-ext

block42 Dev Report

Brokkr — preparing for Synths Activation!

- Add a Testnet Rune Faucet to the Brokkr App on Testnet.
- Add trading limitations (consider Funds Cap & Synth Cap).
- Add parent chain labels to the assets.
- Fix ERC20 dynamic decimals.
- Minor fixes on the UI.
- Minor fixes on the XDEFI integration.

DepOps Weekly Update (11/1–11/22)

cluster-launcher

- Update digital ocean kubernetes cluster slug
- Update dependencies on Digital Ocean
- Update EKS module
- Update dependencies on AWS
- Update dependencies on Linode
- Hetzner bare metal [WIP]

THORmon

DevOps
— Ingress config changes
— Fullnode management

Frontend
— Integration new Testnet
— Chaosnet halt and restart

Backend
— Integration new Testnet
— Chaosnet halt and restart

Full list of Community Projects at https://docs.thorchain.org/ecosystem. Reach out if you want to get on the list.

Bridge status:

Want to see bridges built quicker, get involved!

How to bridge to THORChain? This is a serious undertaking, a dev should be sponsored for 6–12 months:

  1. Read https://gitlab.com/thorchain/thornode/-/blob/develop/docs/newchain.md and https://docs.thorchain.org/chain-clients/overview
  2. Implement the Chain Client https://gitlab.com/thorchain/thornode/-/tree/develop/bifrost/pkg/chainclients
  3. Add to Node Launcher https://gitlab.com/thorchain/devops/node-launcher
  4. Add to XChainJs https://github.com/xchainjs/xchainjs-lib
  5. Launch on Mocknet — demo to the community
  6. Launch on Testnet, stabilise. Must be run successfully for a few weeks with no issues.
  7. Launch on Mainnet, stabilise
  8. Maintain the chain client, be on deck for hard forks, client updates and more.

Next Milestones

  • Churn and Asgard Sharding
  • Activation of Synths
  • ETH v3 Router Upgrade
  • THORNames
  • Vault Nodes (formally Lite Nodes)
  • New Bridges

Community

To keep up to date, please monitor community channels, particularly Telegram and Twitter:

--

--