Postmortem of Stuck Vault #mca3
Summary of the reason why Vault #mca3 was stuck for 35 hours, how the network recovered and things to consider for THORNode Operators.
Vault #mca3 became stuck during the last iteration of the funds migration to Vault #x3ph. Approximately $100k in funds were left over after already migrating around $33m.
For 35 hours the network failed to sign any TSS transactions. The team reached out to some node operators for logs, but couldn’t find anything of concern.
The team initially thought it was due to the recent raise in the node cap from 31 to 36 nodes, and released an update (V15.1) that increased the TSS timeout from 20 to 30 seconds. After releasing the update, many node operators found they were unable to update due to a consensus error on their Binance Nodes, which had halted them. These node operators resorted to resetting their Binance Nodes.
Once enough Binance Nodes had been reset, the network resumed sending funds. The team’s best assessment is that the network became stuck due to more than 1/3rd of THORNodes having halted Binance Nodes.
THORChain continued to process swaps
THORChain uses “asynchronous liquidity delegation” where up to 50% of the network’s funds are moved from the main TSS vault (asgard) and into the hot vaults of each THORNode (yggdrasil vaults). This allows THORNodes to sign small outgoing transactions (typically swaps and refunds) almost instantaneously without requiring resource-intensive and slow TSS. Large transactions that can not be sent from yggdrasil, are instead put in a queue and signed and sent from Asgard.
Throughout the event, swaps continued to be processed from yggdrasil vaults, so the team did not disable frontends or halt swaps, since the network was still available. Some users with large swaps begun reporting their swaps weren’t being fulfilled, and they were asked to wait.
However since Yggdrasil vaults are funded by Asgard, the network slowly began running out of available funds and the outbound queue began to grow.
Considerations For Operators
THORChain has three consensus points:
- Consensus on witness events (loss of consensus means THORChain stops witnessing external events)
- Consensus on state machine block production (loss of consensus means a chain halt)
- Consensus on TSS vault key-signing (loss of consensus means outbound TSS transactions will stop)
If 1/3rd of the network goes offline, then THORChain loses consensus and will halt. In this case, since the Binance Chain nodes for many THORNodes had crashed, the Bifröst module of the stack was failing to witness inbound events, then sign outbound events.
Binance Chain is a fast chain, producing blocks every 500ms, and uses a lot of custom logic in their state machine. It is also a relatively new chain. For whatever reason, some nodes hit a consensus error at a certain height and could not restart. The Binance Chain team are aware of this issue and advise simply restarting the nodes. There are also other considerations, such as increasing resource allocation in case it is memory related.
Consideration For Front Ends
The team and community are still establishing best practise for building on THORChain. Since the network is leaderless there is no central server that can tell frontends when the system is safe and online. Frontends need to work this out themselves, such as using the Byzantine Module for connecting with multiple nodes at the same time in order to determine the correct vault.
As for availability, it seems for now the best indication of whether the network is having any TSS issues is the outbound transaction queue. If a backlog is appearing, it is an indication that the network is having an issue and users should be aware.
This can be queried from http://18.104.22.168:1317/thorchain/queue, and the suggestion is that any amount higher than 10 is an indication of an issue. Frontends should advise their issues to not swap large amounts and be patient if the queue is growing.
The network recovered by the team releasing a patch and node operators tending to their nodes, but the patch did not have any effect, rather the imperative to check their nodes allowed operators to find and fix the Binance issue.
The solution is:
- Better alerts when a Chain Client crashes
- Better auto-management of a crashed Chain Client (auto-restart)
- Higher resources allocated to Binance Chain
The team will work with the community to address all three.
In the future, the primary chains will be ETH and BTC which are far more stable, produce slower blocks and thus easier to sync. This seems to be a problem for Binance Chain only.
To keep up to date, please monitor community channels, particularly Telegram and Twitter: