Kintsugi Runtime Upgrade 1.18.0 Migration Post-Mortem

Dominik Harz
Interlay
Published in
5 min readAug 10, 2022

tl;dr

  • The Kintsugi network bridge functionality was halted on August 8, 2022 13.17 UTC to August 9, 2022 8.39 UTC due to a bug in the 1.18.0runtime upgrade migration code.
  • We identified the root cause of the issue and proposed a fix to restore the Vaults to their original state via a governance proposal.
  • The bridge faced a downtime but no funds were lost due to the faulty storage migration.

Vaults v6 Migration Issue

On July 29th, referendum 52 to remove theft reporting was started. On August 8th, this referendum got approved and enacted.

Because of the removed functionality, several items were no longer required in the vault struct, so these were removed. There was a storage migration that was run to convert the old format to the new. Unfortunately there was a bug in this migration.

A migration consists of two parts: (1) a definition of the old format, and (2) a conversion function to convert this to the new format. The problem was that the old format was specified incorrectly: it omitted the secure_collateral_storage field that was added in the previous migration.

This caused the migration function to misinterpret the old data: when it thought it was reading the to_be_issued field, it was actually reading secure_collateral_threshold . All the data read afterwards was therefore incorrect. As a result, various fields of the new version of the vault struct were set incorrectly. In most cases, this resulted in the issued_tokensfield to be set higher than it should have been.

Impact

Because the data in some vaults indicated that they had issued many times more than they could back with their collateral, some Vaults were marked as undercollateralized. In fact, the off-chain worker automatically tried to liquidate these Vaults. Fortunately, the incorrect Vault data was inconsistent with data stored in other pallets, which caused these liquidations to fail. As a result of undercollateralization these vaults were unable to receive new issue requests.

Detection, Mitigation and Fix

Shortly after the runtime upgrade was enacted, one of our Vaults reported an unusually low collateralization rate to the Interlay team via a private message. We immediately started investigating.

Shortly after the issue was discovered, we stopped the Kintsugi oracles¹. This caused the parachain to go into the error state, preventing any new issue or redeem requests.

While waiting for the oracle to time-out, we investigated the root cause of the issue. When we found it, we wrote a script to construct a call that would force the storage back into the correct state. The correct state was determined by querying the storage prior to runtime upgrade, and applying the intended migration function.

Timeline

  • 7 Aug 2022 ~08:00 UTC: Referendum 52 to remove theft reporting passed governance vote
  • 8 Aug 2022 ~10:00 UTC: Runtime upgrade 1.18.0 was enacted
  • 8 Aug 2022 ~11:00 UTC: Migration from runtime upgrade 1.18.0was performed approximately one hour later
  • 8 Aug 2022 ~11:45 UTC: We received a report about low collateralization rate
  • 8 Aug 2022 ~11:50 UTC: We identified a problem with the upgrade shortly thereafter
  • 8 Aug 2022 ~16:00 UTC: We wrote a manual migration and submitted proposal 58
  • 8 Aug 2022 16:30 UTC: The proposal was fast-tracked to referendum 56
  • 8 Aug 2022 ~20:00 UTC: The referendum was executed and the fix was verified
  • 9 Aug 2022 8:39 UTC: The oracle was re-started after which normal operation was resumed.

Fix Verification

We verified the fix in the following ways:

  1. Overwriting the Vault storage was first done on the Kintsugi testnet to ensure that the chain would not stop producing blocks or be affected in other ways.
  2. We read the Kintsugi testnet state against the pre-migration state of Kintsugi with the migration applied as originally intended. We compared each Vault struct such that it matched the desired Vault struct post-upgrade. Code here: https://github.com/interlay/interbtc-api/pull/449/
  3. We also validated some invariants, namely that the total issuance of kBTC matched the total across all Vaults and that each amount matched the expected stake in the rewards pallet.
  4. We verified that encoding and decoding on the parachain would be done correctly.

After the fix was applied, before resuming oracle operation we double-checked that the fix was applied correctly, using an adapted version of the code listed in point 2 above.

Furthermore, we checked if there were any changes to the vaults that were done in the time between the runtime upgrade and our fix, because our fix would overwrite any changes. Specifically, we verified that no requests affecting the Vault struct were made after the runtime upgrade had been applied.

After that, no issues, redeems, or replaces happened:

We also again modified the script of step 2 above to compare the vaults just after the runtime upgrade with the vaults just prior to the execution of the fix.

As it turns out, there was one change: Vault a3cS7bP56bj11Yrfxt3TZFGjo96R7eJH6WUNYBxg1dx55jCJm had set his custom secure collateral threshold, which got cleared by the execution of our fix. We let the vault know in the vault-lounge discord channel, and they confirmed through a private message that they had received the message.

Next Steps

Ultimately the consequences of this bug were fortunately limited, but it’s not hard to imagine that things could have gone much worse. We have put in place extra safeguards to prevent cases like this from happening again in the future.

As a matter of course, we now use try-upgrade. We had unit tests for the migration, but these tests have to make certain assumptions — in this case, the format of the old vault struct. When these assumptions do not hold, the test becomes useless. The try-runtime functionality allows us to check the migration on actual live data. This particular bug would have been caught if we had compared the result of running the migration on any arbitrary live Vault with what we expected.

Furthermore, we are adding integrity checks that are run in try-runtime such that we may catch unexpected errors.

One additional minor mitigation that we can do is to append new fields only to the end of the struct rather than inserting them in the middle. This would have reduced the impact of this particular bug.

Finally, we will be setting up better monitoring to alert us immediately when unexpected events happen, or when the parachain is in an unexpected state.

¹an emergency measure possible on Kintsugi due to currently only 1 oracle being operational (in the future, specifically on Interlay, decentralized oracle operation will make such emergency measures slower and more difficult to coordinate).

About Interlay

We envision a future where blockchains seamlessly connect and interact: anyone can use any digital asset on any blockchain, trustless and without limitations. Interlay works with Bitcoin, Ethereum, Polkadot, Cosmos, and others to expand interoperability, capital efficiency, and openness.

Our flagship product is interBTC — Bitcoin on any blockchain. A 1:1 Bitcoin-backed asset, fully collateralized, interoperable, and censorship-resistant, realizing the true free nature of BTC and decentralized finance.

Follow our Twitter, Telegram, and Discord to keep up to date with daily updates from the team.

--

--