Post Mortem of Incident on August 5th, 2022

Andy @ Yield Protocol
Yield Protocol
Published in
10 min readAug 9, 2022

Intro

On August 5th 2022, during deployment of our new pool contracts, a bug was discovered which put ~$206,000 of liquidity provider funds at risk. The issue was discovered immediately by a core developer while going through a post-deploy checklist. A war room was convened, parts of the protocol were paused, and the at-risk funds were secured on behalf of affected users. All funds are now 100% SAFU.

We are fortunate that our incident management system held up, proper procedures were followed, and we were able to quickly resolve the situation. While the amount of funds at risk was small (<2% of protocol TVL) the knowledge we gained was great. We’ve learned valuable lessons to enhance our development process, tighten our security procedures, and level-up our team with live war room experience. The Yield Protocol is now stronger and safer than ever. It is our hope that by sharing this post-mortem, the community can also benefit.

What Happened

YieldSpace markets are a core mechanism of Yield Protocol. They enable borrowing and lending via the trading of fyTokens — tokens that represent the payment of an underlying asset on a future date. YieldSpace markets are designed to manage their funds independent of Yield Protocol governance, and Yield Protocol governance can’t access their funds.

As part of our new Series to be launched on August 5th, we were introducing a new version of YieldSpace which we call YieldSpaceTV (Tokenized Vaults). These new pools are designed to be integrated with yield-bearing vaults (Compound cTokens, Yearn yvTokens, Euler eTokens, etc.) to provide extra returns to liquidity providers. In certain cases, there isn’t a suitable yield-bearing vault to integrate with, and we built a variant of the Pool that is intended to directly use an underlying asset which we call PoolNonTv.

This PoolNonTv version had a very simple bug that escaped internal review, unit tests, extensive fork testing, and even a re-review by the auditors. The bug was accidentally triggered at 13:44 (all times given are UTC) by one of our developers going through a readiness checklist. The changes hadn’t been announced, but they were accessible in the frontend to anyone.

The bug immediately drained all funds from the new December USDC pool into the developer’s address. At 16:08 PM, while QA’ing our new features, it became obvious that something was wrong due to the rates being off. Soon it was found that the pool was empty, and the funds were in the developer account.

At 16:21, a war room was convened.

How Did It Happen

Here we describe the origin and the details of the bug.

Smart Contract Implementation

The YieldSpace-TV pools are expected to have slightly different implementations depending on the underlying Yield-Bearing vault. There is a base Pool.sol configuration that integrates with ERC4626 contracts, but non-4626 compliant tokens such as those of Yearn and Euler use a separate contract which inherits Pool.sol and overrides key functions. There is also a version of the pool that allows the TV pools to work with non-yield-bearing vault tokens.

Like most contracts in the YieldProtocol, the pools work without pulling tokens. Instead they expect tokens to be transferred in beforehand, and track balances to know about the amounts transferred. As usual in this pattern, there are functions to remove any untracked amounts. The functions in YieldSpaceTV are retrieveBase, retrieveShares and retrieveFYToken.

In v2, when someone executes a buyFYToken call, they are expected to transfer base into the pool, buy, and then call retrieveBase to recover any unused base tokens. For backwards compatibility reasons this behavior was preserved in TV.

The new pools are not expected to regularly hold any base tokens. They hold shares and fyToken. To facilitate usage, they accept base tokens that will be immediately converted into shares tokens. Any unused shares will be unwrapped into base. Users are expected to call retrieveBase to get these leftover tokens, but for certain dust amounts, its not worth the gas.

As such, retrieveBase simply gives all base to the caller. For 4626, Yearn and Euler pools this is correct behavior.

Note the comment on the right, experience shows that lines with such comments are more likely to contain a bug.

In order to use the pool contract with non-yield-bearing tokens (such as ordinary DAI), the PoolNonTv implementation uses the same terminology (base and shares), but in actuality the two are the same. When the Pool contract logic calls wrap for example, the overridden function in the PoolNonTv contract basically does nothing. When the _getCurrentSharePrice is called, the PoolNonTv contract returns 1.

The constructor in Pool.sol uses the overridden _getBaseAsset function to find the base contract from shares.

PoolNonTv simply takes base as if it would be shares in the constructor, and returns base when asked to do so. Unfortunately, the PoolNonTv implementation failed to override retrieveBase with a function that behaved like retrieveShares, and so this function allowed anyone to extract all base reserves.

Audit

The codebase was audited by ABDK Consulting, who were thorough as usual. Their audit was on commit a8f47b0 and flagged retrieveBase as containing one of the two critical issues in the code base.

The implementation in the commit that went for audit mistakenly had the same code in retrieveBase as in retrieveShares, which makes the function useless, and in some context would lead to funds being stuck in the contract.

The function was hastily replaced by the version that now we know to be faulty due to the inheritance from PoolNonTv:

The fixes for critical and major issues, including this one, were submitted for re-audit, and no flags were raised.

Trigger

When repaying all debt from a vault before maturity, the best option is to buy fyToken at a discount in a pool to pay the exact debt. It does so with the repayVault function, which at the end calls pool.retrieveBase to recover any leftover tokens. To trigger the payout, Bruce just had to repay all debt from a USDC pool and was surprised to obtain an oversized payout.

Actions Taken

At 16:21, once it was noticed that assets were where they shouldn’t, a War Room was convened, it was recorded for future reference.

The war room began with Alberto and Allan. After a brief discussion we agreed that Alberto would work on isolating the Ladle on Arbitrum to stop most interactions happening.

Allan began assembling the team in the war room. Bruce joined immediately and Andy and Marco shortly thereafter. Richie was not online, and was discovered to not be on our PagerDuty, so Allan emailed him and he joined a bit later. During the war room, we followed the procedures laid out in our documented incident response plan: https://docs.yieldprotocol.com/#/operations/emergency_procedure

Alberto didn’t have the plan hash to isolate the Ladle on Arbitrum, so instead opened the EmergencyBrake on etherscan and got the events. It was expected that the first one would isolate the Ladle because that is the first orchestration, from memory. That isolation plan was executed using Etherscan and a developer account, and Alberto made a note to confirm that the Ladle was actually isolated at 16:30 (8 minutes after the War Room starting). This was not checked until much later.

Once the Ladle was believed to be isolated, Alberto started investigating the issue to determine if other actions were needed, like isolating the Ladle on mainnet. Alberto was hesitant to do so because restoring an isolated contract requires a governance action. On Arbitrum there is zero delay on the Timelock, so that could be done easily, but on mainnet it’s two days and the impact would be greater.

Because of an oversight, not all contracts were verified, which made the investigation more difficult. Regardless, the investigation quickly revealed that any call to retrieveBase in any PoolNonTv contract would lead to the base reserves being immediately drained.

This meant that, in addition to the USDC pool on Arbitrum, the DAI pool on Arbitrum and the FRAX pool on mainnet were at risk, but that the DAI, USDC and ETH pools on mainnet, with over 98% of the protocol’s TVL, were safe. The Frax pool was examined and found not to contain any significant funds.

At that point, the frontend was not showing the December pools anymore, so the risk of the affected Dai pool being inadvertently drained was diminished. But still, anyone could do a simple call on etherscan to trigger it. We quickly assessed that the DAI pool had around $50K TVL. We decided to drain the DAI pool on Arbitrum, and do nothing on mainnet since the FRAX pool had no significant funds.

We didn’t believe we could directly call retrieveBase on Etherscan to drain the DAI pool, because that might be front-run by a generalized bot, but we could repeat the process that led to draining the USDC pool. This process runs in two separate transactions:

  • On the first transaction, a vault is created and assets are borrowed, creating debt.
  • On the second transaction, all the debt is repaid, triggering the withdrawal of funds. Only the owner of the vault created in the first transaction can do this, and we didn’t think that generalized bots would be smart enough to try a set of transactions. We also had the assurance of the first draining event and Bruce successfully obtaining the assets.

A governance proposal was put together to bring the Ladle back, testing it on a Tenderly fork. While that happened, Allan tried to execute the whitehat, to confirm the Ladle was actually isolated. It was.

In trying to retrieve the Dai funds, we struggled a bit getting the signatures in gnosis safe. We then had issues where the frontend didn’t call ladle.repayVault at the first attempt, so Bruce pushed a code change and, after a second attempt, Allan successfully retrieved the Dai funds.

Funds were now safe. Because of the small dollar amounts, we decided to keep the funds in developer wallets to avoid a mistake in the heat of the moment. We spend some additional time determining that other pools were not at risk, and decided to pause any deployment of December pools. We then prepared an incident announcement, had a debrief session and then disbanded the War Room.

Impact and Resolution

The funds are now safe, and the situation can be restored to normal.

The broken pools will be disconnected from the Ladle on arbitrum and mainnet, by executing the corresponding add* functions on the Ladle. The broken pools can still be used if wrapped in a flash loan, so disabling them is probably best. This can be maybe done by calling pool.setFees(10000) which would stop any trading function from returning any assets (and maybe even cause trades to revert).

New Pools and Strategies will be deployed. The pools will be initialized normally, but the strategies will be initialized with the recovered funds. The Timelock shall keep the strategy tokens, which can then be distributed to holders of the older strategy tokens, pro-rata to the new tokens minted to their old holdings.

The distribution can be done using disperse.app, with a simple loop on the Timelock, or with a permissioned Disperse contract that a developer account is authorized to use. The number of holders will determine the approach to take.

Lessons Learnt and Recommended Changes

Release Process

  • Deployment of critical releases should be done only when team is at high capacity. This deployment occurred when several team members were planning not to be available.
  • This is not the first time we deployed something on a Friday and ran into issues, but it will be the last. No more Friday deployments.

Defensive code development

  • Lines with comments are more likely to have bugs. Think about cases that might make the comment be false.
  • Audit fixes are more likely to have bugs, extra attention should be paid to those.
  • The PoolNonTv pools were not extensively tested because they were assumed to be simpler. In fact, it was hoped that NonTv pools would some day be phased out if we could find ways to earn yield on all base tokens. Risk and therefore attention should be determined by potential impact, and not perceived complexity.
  • On mainnet the Frax pool doesn’t have funds which made it difficult to conduct tests on forks, which is something we do a lot of. The arbitrum roll was barely tested because we didn’t think it had significant differences with mainnet and also because of the relative insignificance of the funds. Combined these two factors left the PoolNonTv contracts barely tested on forks.
  • Inheritance hides potential errors. retrieveBase was not unit-tested on PoolNonTv contracts as it was not present there. Fully flattened contracts also do not seem to be the answer since it would add thousands of SLOC and make it difficult to review and maintain. But something must be done, as similar errors might happen with custom versions of the Join contract. At a minimum, whenever making a change to something that affects multiple “versions” (or modules as we have called them), it is important to consider the impact on each version carefully as well as writing tests for each.

Emergency Process

  • The EmergencyBrake needs to be refactored for ease of use. We already knew that.
  • The delay on the Timelock stopped us from isolating the Ladle on mainnet, potentially exposing the larger part of the TVL when we still didn’t know it was safe. The multisig must be given direct access to restore plans on the EmergencyBrake.
  • Alerts are required on every asset-holding contract (Joins and Pools) to alert as soon as a certain threshold of the assets they hold leaves. This might require new features from Tenderly.

--

--