An inside perspective, Written by @zanicar
This article provides an inside perspective on the release pipeline of Archway upgrades in light of security advisories, testing issues, and continual improvement. It covers our most recent planned upgrade to CosmosSDK v0.50 and CosmWasm v0.51 (at the onset, still at v0.50).
Background
The protocol team at Phi Labs, core contributors of the Archway network, is constantly working on innovative and novel features, such as Callbacks and FeeGrants, that allow for some very interesting smart contract use cases. However, as with most things in life, we often face mundane or less exciting tasks such as upgrades. Specifically, in this case, the upgrade to CosmosSDK was v0.50, and CosmWasm was v0.51. However, these ‘boring’ tasks are required on the path to new exciting features and, more often than not, result in some unscheduled ‘excitement’ when things don’t go exactly as expected…
Expecting the Unexpected
As experienced engineers, we know to expect the unexpected. For this reason, we have several mechanisms and processes to deal with unforeseen circumstances, such as automated testing, robust release processes, monitoring and control mechanisms, etc. However, there may be times when these systems and processes interact to produce unexpected results.
For example, during a recent protocol release cycle, we already had a release candidate tagged for release to our public test network. However, we received a security advisory regarding a critical upstream dependency on the day we were scheduled to deploy this version. The required upgrade was consensus-breaking and vital to our mainnet. Thus, we had to tag and release a new mainnet version, and on this front, things went exactly according to our emergency release plan. However, it resulted in a version bump (see our blog post on blockchain versioning) and unforeseen consequences in our general release pipeline…
Emergent Behaviour
Our emergency release plan allows for coordinated binary swaps when critical security upgrades are in play. Thus, our mainnet went from Archway v7 to Archway v8… but on our testnet, our tagged release candidate had to be bumped to Archway v9. This initially resulted in some ambiguous conversations as we had to keep track of the version formerly known as v8 (to avoid this issue in the future, we are now using release code names, hence Archway 50/50). However, something went silently wrong with getting this new version deployed to our testnet…
Our robust automation tooling detected an issue with upgrading from v7 to v.9 and silently reverted to v7. In our eagerness to get the release back on schedule, we didn’t notice this as the network booted back up and started producing blocks. We then conducted internal tests and notified our developer community about the upcoming release. At this point, our developer community noticed the discrepancy in network versions…
Back to the Unexpected
This time, we confirmed the appropriate network version and updated our tooling and automation processes to be very explicit in this regard. At the protocol level, all tests were passing… but then we got reports that internal application tests were failing! Smart Contract queries returned success conditions but with zero content. Our indexer was not indexing any events post-upgrade and, likewise, was not reporting any issues or errors.
On the one hand, we uncovered a bug in the arch3.js library. The version sniffing mechanism that determines which client to use (Tendermint37Client or Comet38CLient) disconnects the inappropriate client. However, the client disconnect cascaded into the HttpBatchClient leading to misbehavior in the appropriate client and ensuring that the HttpBatchClient is only created after version sniffing resolves this matter.
On the other hand, the issue with the indexer, albeit distinct, turned out to be very closely related. Our engineers highlighted that upstream dependencies should be confirmed to use the appropriate Comet38Client. They then uncovered an upstream dependency PR that addresses build-related issues on correct client detection based on the relevant Cosmos SDK version. An updated version of this dependency includes this PR and support for Comet38Client. Upgrading to this version of the dependency resolved the issue with the indexer.
Conclusion and Key Takeaways
The conclusion and critical takeaways are that we expected these issues to be reported as errors and caught by our tests. However, seeing as these silently failed, they managed to slip through the cracks. These cracks are quickly addressed by including regression and integration tests confirming expected behaviors even when functional tests pass.
The Case for “Live” Testing
“Can automated testing prevent all issues on testnet?”. Good question. This could easily be classified as a “final boss” question because strong opposing opinions may arise. But IMHO, the answer would be “No”. We cannot possibly test for every case without perfect knowledge, and if we had perfect knowledge, we would not need tests in the first place; without perfect knowledge, unforeseen edge cases can always arise…
As the dust of client selection issues settled and the general testnet environment was stable again, a more sinister issue raised its head. We noticed transaction failures across the board for services using our Guzzler Club product. For some inexplicable reason, these transactions ran out of gas (exceeding the limit set by our FeeGrant module). Something in the new consensus engine or the upgraded smart contract engine results in higher gas consumption, but only in some instances…
Our engineers revisited our contracts and applied further optimizations to reduce gas consumption to previous levels, effectively addressing the issue. However, at this time, we have not established the source or cause of this, and several theories abound. The upgraded consensus engine and smart contract engine may result in different execution pathways from the previous version, which adds to gas consumption. Thus, any previously optimized contracts may need to be revisited to ensure they are optimized for the latest version.
Performance testing and benchmarking may be able to detect this type of issue. The point is that we test for the things we expect might fail, for the things we expect might change, for the ways we expect bad actors might attempt to exploit a system… but unknown and unforeseen issues may still crop up, and we best be prepared to deal with them quickly and efficiently. Regardless of what automated testing may entail, we still need to test products, services, and applications in an environment that is as close to the live environment as possible… and for this reason, we deploy network upgrades to our testnet for both ourselves and our developer community, before we deploy it to mainnet.
What about an additional testnet?
Currently, we utilize two testnets before any release to mainnet. All protocol changes, updates, and upgrades are tested locally before progressing to our internal testnet, Titus. Titus runs on the same infrastructure specifications as our public testnet, Constantine, and our mainnet, Triomphe. However, it is configured with very short governance periods and other relevant parameters conducive to testing and faster iteration. Very importantly, it is also an unstable network, meaning state resets should be expected. Consequently, only the most common Smart Contracts and most basic states are typically present on this network, making it unsuitable for general state management, transition, and continuity testing.
Constantine is our public testnet where developers get to deploy and test their smart contracts. It attempts to be as close as is reasonably possible to the stability and continuity of Triomphe, our mainnet. However, it is the only place where the above-mentioned state-related test can reliably be conducted, which sometimes means it will experience issues. It is afterall, still a testnet.
We have been asked internally and by our developer community if we can produce an additional testnet to fill the gap between Titus and Constantine. A testnet that maintains some state, more than Titus but less than Constantine, to allow for efficient state testing. I advocated for this idea as it makes sense from a theoretical perspective. But, we have to evaluate the cost-benefit of this endeavor to ensure its practicality…
On closer examination, it turns out that although theoretically sound, it is unfortunately impractical. First, this network will require state, specifically state from smart contracts, which entails that these need to be deployed to the network. However, they will also need to operate and be upgraded to remain aligned with the state encountered on Constantine and Triomphe to remain relevant. Thus, the same developers and internal resources will be burdened to own this task as an additional overhead. In addition, when testing on this network is conducted, the same resources will have to take responsibility for identifying, reporting, and potentially rectifying any issues that may be uncovered. This renders such a network impractical, as its purpose in the first place is to reduce the burden on development resources… instead, it would only shift the burden to this network, but with added overhead for both Phi Labs and our developer community.
Conclusion
At the time of publication, we will have dealt with not one, not two, but three distinct security advisories that each individually caused delays, on top of the testing issues we uncovered, to the planned release of Archway 50/50. We have gained valuable experience and adapted our pipelines to incorporate the lessons learned to reduce the burden and limit these types of disruptions in the future. However, we can safely conclude that there will always be events or circumstances that are not directly catered for, and developing the capability, preparedness, and processes to respond to them as a team is necessary efficiently. I am personally very grateful to be part of the team at Phi Labs, which most certainly has this capability and is continually working to improve it!
Link to the original post on @GitHub: https://github.com/orgs/archway-network/discussions/41