Kava 5 Launch Post-Mortem

Kevin Davis
5 min readMar 5, 2021

--

The Kava 5 launch was rolled back after a bug was discovered by the Kava core developer team shortly after launch. In this post mortem, I’ll document exactly what happened, what the root cause was, what steps are being taken as an organization and community to prevent this in the future, and provide some commentary from my perspective as the Head of Engineering at Kava Labs.

Timeline of Events

Note: Events took place from March 5th — March 6th, 2021, all times UTC

  • 13:00 Kava-4 halts on schedule. Migration to kava-5 begins.
  • 15:05 Kava-5 chain launches when a 2/3 majority of staking power comes online.
  • 15:40 While conducting QA tests on the newly launched chain, Kava engineers observe unexpected values in some users HARD claim objects.
  • 16:00 While searching for root causes, Kava engineers observe anomalous HARD claims objects for some users.
  • 16:05 After reviewing available options, and confirming all user funds are safe, the decision is made to use the safety committee to shutdown the chain.
  • 16:20 Kava-5 chain is halted by a software upgrade proposal from the safety committee.
  • 17:00 Root cause of bug is determined. Decision is made that it is best to recommend a rollback to the kava-4 feature set. Engineerings begin formalizing plan to revert chain.
  • 18:30 Recommended plan for rollback is publicly communicated. A new version of kava is released with rollback instructions
  • 06:30 Kava-6 launches.

Root Cause

As part of the developer team’s internal audit of HARD protocol, this PR was introduced to fix a bug in the accounting logic of HARD protocol liquidations. During the code review, it was pointed out that safe subtraction should be used to avoid negative coin amounts, which cause a panic in the cosmos-sdk sdk.Coins object. The fix to that issue was incorrect:

Quite obviously, this subtraction of coin sets results in the disjoint elements of borrowedCoins being completely dropped from the calculation. This miscalculation of total borrowed and supplied coins then had the downstream effect of producing inaccurate reward calculations for HARD Protocol claim objects which was caught immediately after Kava 5 launch.

Bug Bounty

All addresses that attempted (successfully or unsuccessfully) to claim HARD rewards on kava-5 will be distributed a 250 HARD bug bounty.

Analysis

Because Kava has a unique liquidity incentive architecture, it’s worth breaking down what exactly a “HARD claim” is and what it represents. In HARD protocol, users earn claims on HARD tokens proportional to the amount of liquidity they provide. The claim object represents the balance of HARD tokens the user would be entitled to if they claimed with the longest vesting period (1 year). At any time, a user can claim their HARD token balance using a MsgClaimReward transaction. When claiming, the user declares how long the HARD tokens will be vesting, either 1 month or 1 year, and the tokens are transferred from the hard module account to the user’s vesting account. Because all Hard claims are vesting, users cannot immediately transfer tokens from MsgClaimRewards .

A few observations:

  • No inflationary HARD tokens were created or could have been created as a result of the bug. Users were accumulating anomalously claims on the hard module account balance. Because the hard module account balance is fixed, it would have been impossible to pay these claims out.
  • Users who did successfully claimed anomalous HARD token balances could not have spent them. By design, time-locked rewards prevent scenarios where a bug in reward calculations can immediately be withdrawn.
  • The safety committee worked as intended. The goal of the safety committee is to give Kava an ‘escape hatch’ for high severity and/or actively exploited bugs that potentially put user funds or the security of the chain at risk. Pretty much the only time this is useful is right after upgrades, or immediately after an exploit is observed. Most importantly, while it provides a method to pause the chain, it does not provide a method to it to bring it back online — that is up to the validators and done in coordination with all Kava governance participants. Kava Labs and the greater Kava governance will always be transparent in its decision making process, and are always open to feedback about how to be better stewards of the platform.

Moving Forward

The first thing that jumps out to me when I see a simple arithmetic error escape code review during an audit is that the audit process was rushed. Given that this was the first time we incorporated an audit from an outside team member (contrasted to an outside auditing firm) into our release cycle, it seems likely that we under-budgeted time for reviewing audit fixes which came up and will mitigate that going forward, both by budgeting more time upfront, but also by incorporating an audit completion check into the process where we decide if the allotted time was insufficient and should be extended. In this case, I think we simply stuck too rigidly to our timelines and should have taken an additional week to address audit reviews and PRs.

The second thing that has come out of this, and previous development cycles, is how crucial it is to remain committed to code freeze deadlines, and how release schedules need to be updated if the core state-machine code becomes “un-frozen” for any reason. In this case, we had an incentivized testnet running concurrent with the audit. An additional incentivized testnet after the audit and fixes were merged would almost certainly have caught this bug, and is something we will incorporate into future release cycles.

Additionally, this launch highlighted the importance of pre-communicated, well-planned launch reboot procedures. While communications remained strong during the downtime, a better pre-planning process could have avoided some of the additional downtime incurred while planning the exact sequence of the relaunch. Going forward, we will have a ‘rollback’ playbook for each launch that we will communicate so that launch members have a routine to follow in the case of launch rollbacks.

Thoughts

Building decentralized finance applications presents many engineering challenges — we are constantly learning from our mistakes, evolving best practices, and looking at the work of industry leaders to see what can improve. We are also a growing team and still learning how to scale our engineering operations to the size of our ambitions. While this launch was ultimately unsuccessful, there are many improvements to the product development and release cycle that we have made that I am extremely proud of and extremely excited to keep improving. This will not deter us: we will learn from our mistakes and remain hungry as ever to keep building the future of DeFi.

Stay in touch!

Disclaimer: This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making investment decisions.

--

--

Kevin Davis

I work on interledger and other cool tech in the blockchain and interoperability space for @kava-labs.