Kintsugi Slot Duration Bug (2022–12–01) Post-Mortem
The Technical Rundown
- The 1.20.0 runtime upgrade was enacted on Kintsugi at block 1,983,993 [2022–12–01 09:32:18 (+UTC)] which contained this change
- Collators failed to produce blocks due to this panic: “Timestamp slot must match `CurrentSlot`”
- After the root cause was identified, the team coordinated with Parity on a fix using this migration
- The issue will be fixed via a Kusama governance proposal and a codeSubstitution for Kintsugi collators at block 1,983,993
This document contains a detailed timeline of what happened, a post-mortem for the technical issues encountered, findings of our investigations, and steps to be taken to prevent such an issue from recurring or, if prevention is not realistic, steps to be taken to mitigate any impact.
13/10/21: Kintsugi chain was started with a slot duration of 6s being set in the genesis
- Register: https://kusama.subscan.io/extrinsic/9021523-2
- Leased: https://kusama.subscan.io/extrinsic/9634925-0
11/02/22: Upgrade to polkadot-v0.9.16 failed
- Kintsugi collators running 1.17.1 stopped producing blocks. Not all collators were updated at the same time, so block production was not stopped on Kintsugi
- Error: “rejected: too far in the future”
- Caused by the change in slot duration from 6s to 12s here related to this change which updated the client service to load the slot duration from this runtime constant instead of the database / genesis
- Fix required hard-coding 6s in the collator node client for Kintsugi here
- All Kintsugi collators continue to produce blocks
- Migration was identified by Parity to upgrade runtime at a later date but was not performed due to perceived severity
16/11/22: Security advisory by Parity
- Issue on Aura for Parachains that might affect block production mechanism where a malicious block producer could potentially build a block
- Parity asking all parachains team that use Aura and have decentralized their collators to do a runtime upgrade as soon as possible with these code changes
01/12/22: Kintsugi enactment of runtime 1.20.0 failed that includes the security advisory
- All Kintsugi collators stopped producing blocks
- Error: “Timestamp slot must match `CurrentSlot`”
- Caused by panic in hook introduced here (due to security advisory by Parity)
- Slot duration here was fetched from the timestamp pallet, yielding a 12 second slot duration rather than the 6 seconds used elsewhere.
- Hence, the expected slot was exactly half of the actual slot, resulting in the panic.
- Fix required aforementioned migration to update `CurrentSlot` and update node client for Kintsugi to use 12s slot duration here
02/12/22: Fix tested and Kusama governance proposal
- Built 1.20.0-hotfix to include fix
- Built 1.20.0-hotfix+sub to include code substitution for collators at block 1,983,993
- Preimage submitted
- Whitelisted governance proposal up for voting: https://kusama.polkassembly.io/referenda/8
- Incorrect genesis configuration of Kintsugi to have 6s block-times.
- Failed 1.20.0 runtime upgrade due to replacing the `OnTimestampSet` hook with `Aura` (change made based on security advisory by Parity) which has a failing assertion.
Incorrect genesis configuration of Kintsugi to have 6s block-times
We encountered an error during the upgrade from polkadot-v0.9.15 to polkadot-v0.9.16 in early 2022 which prevented Kintsugi from producing new blocks. At genesis, Kintsugi was misconfigured with a slot duration of 6 seconds instead of the expected 12 seconds for parachains. In the upgrade to Kintsugi 1.1.0 this change was introduced to update the slot duration to 12 seconds which caused block syncing to fail for nodes using 1.7.1 because of this change — previously the 6s variable in the node service was read from genesis.
Fortunately, collators running 1.16.1 were able to continue block production since they did not yet have this change to the node service. We included this fix in the 1.17.4 release to force Kintsugi collators to use the old value of 6s instead. As this did not require a change in the runtime, collators were able to switch to this patch release without governance. This was required since otherwise block production would have failed on Kusama due to other breaking client-side changes.
The issue did not cause a stop of block production of Kintsugi.
Since February 2022, the faulty 6 seconds genesis configuration has been a known issue and was on the list to be fixed via a migration suggested by Parity: https://hackmd.io/@XuVYQ1rUQjGv8uRzzsdzuw/HkduIX4y5
By itself, the incorrect genesis configuration did not cause the stop of block production at block 1,983,993. The missing migration did not cause issues for 14 runtime upgrades on Kintsugi: https://kintsugi.subscan.io/event?module=parachainsystem&event=validationfunctionapplied
Failed 1.20.0 runtime upgrade due to replacing the `OnTimestampSet` hook with `Aura`
The Kintsugi network stopped at block 1,983,993 due to the incorrect genesis configuration and the missing migration which led to a miscalculation of the slot duration when upgrading to release 1.20.0. To be more specific, there is a panic in runtime 1.20.0 when a collator tries to produce a block:
[Parachain] panicked at ‘Timestamp slot must match `CurrentSlot`’
This is triggered when the collator submits an inherent to set the timestamp which calls the hook here. In the Aura implementation this checks the following:
- CurrentSlot == Timestamp / (MinimumPeriod * 2)
Let’s plug-in some values from the last successful Kintsugi block:
- CurrentSlot = 278314523
- Timestamp = 1669887138551
- MinimumPeriod = 6000
Therefore the assertion does not hold and the hook panics:
- 278314523 != 1669887138551 / 12000 (~139157261)
In summary, the Kintsugi team did not run the slot duration migration prior to 1.20.0 as suggested by Parity so the temporary fix remained in place. With the 1.20.0 runtime upgrade, we included a recommended change to prevent malicious block producers from potentially building a block.
Advisory by Parity on Wed, Nov 16 2022
> Hey all (@room)!
> Wanted to bring to your attention an issue on Aura for Parachains that might affect your block production mechanism (a malicious block producer could potentially build a block). The probability of this happening is still low, and it will potentially affect you if: (1) you are using Aura on your parachain; and (2) you don’t control all your collators.
> How to solve this
> You need to do a runtime upgrade on your parachian, to include this commit.
> Please do let me know if you have any doubts! If you could be potentially affected by this, we recommend doing the upgrade as soon as possible.
The recommended change replaced the `OnTimestampSet` hook with `Aura`. Due to the missing migration, a check in the Aura hook correctly panicked due to an incorrect slot duration calculation which prevented Kintsugi from producing blocks.
With the missing migration and using Aura for timestamps, Kintsugi collators were not able to produce new blocks as the relay-chain validation would reject any new blocks with the updated runtime. Before the 1.20.0 runtime upgrade, the Aura current slot would have to be migrated as done in the fix here: https://github.com/interlay/interbtc/pull/792
Unfortunately, this issue was not caught during the release process, code reviews, and testnet upgrades that were done prior to the runtime upgrade on KIntsugi. It also would not have been caught by the try-runtime testing approach as this issue is specific to collator producing blocks and the runtime migrations as such were not an issue. The issue would have been caught by using try-runtime in combination with the `follow-chain` subcommand.
The issue is specific to the Kintsugi configuration and was hence not caught by the testing process on testnet. We are not aware of any other parachains being affected by this.
There are two parts to solve this issue:
- Updating the Kintsugi runtime on Kusama: Collators are unable to produce blocks. Even by using `codeSubstitutes` Kintsugi cannot recover alone due to the bug described above. Hence, a new wasm needs to be set via Kusama governance on the relay-chain.
- Updating all collators on Kintsugi to use the patched wasm: Updating the Kintsugi runtime on Kusama needs to be executed together with providing all collators in the Kintsugi network a patched wasm to be used via the `codeSubstitutes` approach.
Updating the Kintsugi runtime on Kusama
The runtime requires the Aura slot duration migration. In essence, the migration reads the `current_slot` and updates it with the `new_slot` as follows:
- old_slot_duration = 6s
- new_slot_duration = 12s
- timestamp = current_slot * old_slot_duration
- new_slot = timestamp / new_slot_duration
A new wasm runtime was generated from the PR (https://github.com/interlay/interbtc/pull/792) and released as 1.20.0-hotfix: https://github.com/interlay/interbtc/releases/tag/1.20.0-hotfix
The new wasm runtime is submitted as a governance proposal to Kusama to update the current Kintsugi runtime via `paras.forceSetCurrentCode(2092, wasm)`.
Updating all collators on Kintsugi to use the patched wasm
Updating the runtime via Kusama governance itself is not sufficient. Collators still have the current, non-functional wasm and will only automatically update the wasm code when a new block comes in. Hence, to recover the collators, we need to manually set a runtime code for block 1,983,993. This is achieved by adding the block number of the current Kintsugi block with the hex-encoded wasm that was submitted to Kusama governance to the `parachain/res/kintsugi.json` res file that is used for the
` - chain=kintsugi` argument when starting a collator:
The change was made in PR https://github.com/interlay/interbtc/pull/793 and released to collators via https://github.com/interlay/interbtc/releases/tag/1.20.0-hotfix+sub.
Verifying the solution
We took multiple avenues in verifying the solution:
- Code review: The solution in PR https://github.com/interlay/interbtc/pull/792 was reviewed internally and together with the help of Parity.
- Unit test: The PR includes a unit test for the collator migration.
- Manual testing: The most time consuming tests were to get an environment up and running to replicate the initial issue and verify the tests. In the end, we use polkadot-launch to spin up local test environments. While the tool is deprecated, it still worked well with polkadot-0.9.31 and our parachain releases. With polkadot-launch, we spun up a rococo-local chain and verified the fix against a kintsugi-testnet instance that was modified to have 6 second block times. The fix was verified by two different persons working on different computers.
Manual testing steps
Reproducing the breaking issue
- Checkout parachain 1.19.1
- Change all 12000 values to 6000 in https://github.com/interlay/interbtc/blob/63b485b2e8105eeaf20d960ed0a89443d2567d2a/parachain/src/service.rs
- Compile (always with — release — features rococo-native). Call the binary interbtc-1.19.1–6-sec-no-aura
- Launch using polkadot-launch docs/rococo-local-persistent.json (json config here)
- Connect to polkadotjs localhost on port 9988
- Confirm that it’s producing blocks.
- Confirm that aura.currentslot = 278,320,992 (or close to that)
- Enable Aura timestamp hook: replace () by Aura here https://github.com/interlay/interbtc/blob/e06f426dc754b3cfce388884c3c9c8dc82afca7f/parachain/runtime/testnet-kintsugi/src/lib.rs#L302
- Compile. Do runtime upgrade to target/release/wbuild/testnet-kintsugi-runtime-parachain/testnet_kintsugi_runtime_parachain.compact.compressed.wasm
- Observe “Timestamp slot must match CurrentSlot” panic
Reproducing the fix
- Observe the arguments that the parachain node was called with cat /proc/$(pidof interbtc-1.19.1–6-sec-no-aura)/cmdline | xargs -0 echo
- Kill the parachain: killall interbtc-1.19.1–6-sec-no-aura
- Cherry-pick the fix: https://github.com/interlay/interbtc/pull/792
- Modify this line to the current blocknumber + 1: https://github.com/interlay/interbtc/pull/792/files#diff-a9dfd0024688aa1540849f682c4950a284737ab5d36da3bc18e10f584709dcf7R275
- Compile. We’ll call the compiled node interbtc-1.19.1–12-sec
- On the relay chain (port 9944), call paras.set_current_code(2000, fixed_with_migration), where fixed_with_migration is the newly compiled wasm blob.
- Checkout a clean 1.19.1 & compile (such that we have 12s in the service.rs file)
- Generate chainspec: interbtc-parachain build-spec — chain rococo-local-2000 > rococo-local-2000-substituted.json
- Modify rococo-local-2000-substituted.json: change codeSubstitutes to e.g. “28”: “0x123123123….”, where the hex is the wasm blob is the same fixed_with_migration from before. You can get the hexencoding of that blob by running xxd -p ./fixed_with_migration | tr -d ‘\n’. Don’t forget to add the 0x prefix. NOTE: block number must be current highest block number, NOT incremented by 1
- Start interbtc-1.19.1–12-sec using the arguments obtained above, but replace — chain rococo-local-2000-substituted.json
Preventing this in the future
To be more cautious with governance proposals from now on, we will integrate the following checks in our release process:
- Integrate try-runtime into the CI to run automatically for each release
- Use follow-chain to ensure collators still produce blocks after a runtime upgrade
- Whenever useful, check proposals that don’t upgrade the runtime against forks of mainnet by using Chopsticks (this cannot check collator node logic)
The results of the runtime upgrade tests will be shared as part of the governance proposals or discussion post for each runtime upgrade so that they can be independently verified ahead of the referendum enactment.
Separately, Interlay believes that parachains need to be self-sovereign. That means:
- Each parachain should have a way to recover itself from not producing blocks.
- Relay-chain governance should not play a part in parachain recovery.
There is an ongoing discussion post in the Polkadot forum (https://forum.polkadot.network/t/how-to-recover-a-parachain) and some form of technical implementation will be required.