The incomplete gas patch and why it caused consensus failures

Simon Warta
CosmWasm
Published in
4 min readAug 12, 2024

On Thursday, August 8th we released a patch for a Medium severity vulnerability in CosmWasm called CWA-2024–004. The bug in question had the potential to delay block production and cause other unintended resource consumption in certain edge cases.

While severity is a global measurement across all users of CosmWasm, this is certainly more important to fix for permissionless chains than chains that only accept codes from pre-approved entities. So, while being non-critical, we recommended permissionless chains to perform an upgrade in the week of the release.

What this patch does is two things:

  1. Increase the gas for breaking operations (loop, br, call, return, …) from 170 to 1610
  2. Reduce the gas for everything else from 170 to 115

This is due to the additional metering operations that will be performed for breaking operations. To put it simply, a mispricing had been fixed to readjust for the target of 1 Teragas/second.

This fix has been well prepared, benchmarked, tested and backported to the 1.5, 2.0 and 2.1 release series of CosmWasm.

A couple of hours after the release, two teams reached out wondering why the gas usage did not change after the release or why patched and unpatched nodes are still in consensus — very strange. We checked a few obvious things like ensuring gas_used is part of consensus in CometBFT or the gas changes are high enough to be visible in Cosmos SDK gas. Even for the very experienced team at Confio there was no obvious way to debug.

After some time we received the crucial hint from Sergey Golyshkin (Neutron team): the gas is part of the contract compile step and may only be applied during code upload. This was it! All compiled contracts in the cache folder ~/.myd/wasm/wasm/cache continued to run with the old gas pricing instead of the new one. As a result the patch was only applied for new codes or after re-compilation. The second happens whenever the module cache is invalidated, e.g. by spinning up a new node from snapshot without the cache directory, by manually deleting the folder or by invalidating it in the cosmwasm-vm codebase. This is why we labeled the patch as incomplete, not wrong. It was good but there was this missing detail.

Once understood, we quickly released a set of follow-up patches the same day which invalidated all old caches such that no matter what cache the nodes have, they now all run on the same logic.

But why?

The patch we released worked and behaved perfectly fine in various levels of testing, such as unit and integration tests in the cosmwasm repo as well as in Go land in wasmvm and wasmd. What all those tests have in common is that their caches do not live longer than the test itself. They started fresh where a node in the wild does not.

We do not maintain testnets anymore. While being part of our proposal to the ICF for 2024 was to run and maintain a Vanilla CosmWasm testnet, this has not been prioritized. Applying the patch there first might have sped up the discovering, but it would not be a guarantee for success as such networks typically have a small amount of traffic.

In Open Source, security patches are usually released to the general public in one go to give everyone equal access to a patch. Applying the patch to a public testnet first would provide significant advantage for an attacker to analyze the patch and craft an attack. Running a private testnet might help but is something not in scope right now. Also having to maintain a shadow stack in private repos including cosmwasm-vm, wasmvm and wasmd and have a completely separate CI for this is a lot of work — maybe worth it, I don’t know.

Lessons learned

  • Getting caching right is hard. We solved similar problems in the past before and will solve it here again, but there is a cost to pay for enhanced execution speed.
  • Wasmer metering is baked in at compile time
  • There is a class of issues you just don’t find in isolated unit tests. Higher level tests are not only needed to find uncovered code but also to find such things.

To wrap up, yeah, it was bad. But the problem was fully understood and patched the same day it popped up. It will only lead to a more robust solution from here. So time to continue building.

--

--