Preventing bad code slipping through the cracks: Compound Finance change management
On the 13th of October, Compound Finance deployed an updated contract containing a small software error that had a widely documented $178m effect on Compound’s ecosystem. DeFiSafety analysis puts this entirely down to errors in change management practices.
In short, change management is a complex skill that all updatable protocols must learn or else DeFi will suffer embarrassing incident after embarrassing incident.
This is the second in our series of articles that looks at recent DeFi incidents and discusses how better processes could have reduced the likelihood of the incident occurring. The first was our take on Iron Finance and Pause Control, where I used my aerospace software background and apply it to the painful (lest we forget expensive) lessons DeFi has still to learn.
Disclaimer: DeFiSafety is not fully familiar with the details of the Compound software incident. All our views are our own and are subject to change given new information. If you feel like this is an unfair judgement, please contact us on Telegram — we’d love to discuss it and update this article with you. This article talks about software processes in general and uses this Compound incident as an example. The Compound team did not respond to our requests for comment on this article and DeFiSafety apologizes in advance if any errors are made which will be corrected immediately should they be found. As always with any DeFiSafety communication, this is not financial advice.
Change Management
What is Change Management? It is the management and software process to plan, track and execute a change in updateable deployed code. Clearly, Compound had a process for managing the change using the DAO as the responsible entity as identified in their management summary, an RFP and a detailed software PR. This is about improving this process and reducing the likelihood of similar occurrences.
Testing — Unit and System
Any software changes should obviously be put through existing system tests, in addition to new ones being created for the new software specifications. In the case of Compound, this was documented. In retrospect, it is easy to criticize the testing because the bug clearly snuck past.
However, looking at the PR (before deployment) comprehensive, clearly documented and executed testing is obvious — as it should for a protocol as respected as Compound. There were both unit and system tests, alongside deployment on a test network where the community was asked to test the change independently.
In aerospace, it is critical that testing is always done by someone else than the coder. In the case of this Compound exploit, it appears the dev did the testing. This is a simple step that will massively improve the code: different eyes improve quality.
This moment presents an opportunity to push for comprehensive dynamic system test environments in DeFi. Andre Cronje famously said he “tests in prod” and this was because the huge dynamic DeFi environment is simply too complex to build into a test environment. This was true when yearn was born. Today, with the billions in DeFi, we believe it is both possible and necessary to start creating big tests environments. This recent Compound issue is a prime example, and these problems are only going to get more costly and problematic.
A large and dynamic up to date test environment is a big investment. This will likely too big an effort for a single big protocol but if the industry could come together, create and support such an environment then the whole of DeFi would greatly profit from this.
Documentation
Documentation is especially important after something goes wrong. It allows you to clearly see what you did, which enables analysis on what improvements can be made. Here, Compound maintained their high standards: the documentation is clear, well referenced and appears complete. The documentation meets our standards to the fullest extent, as we identified when we gave them a 96% score in our review.
Quality Reviews
It is easy to disparage a quality organization (and believe me in my time in aerospace I did it a lot), but it is important to remember they fulfill an essential role! They are an in-company independent check on the process and execution of a change. They routinely bring intelligent, knowledgeable, and most importantly independent eyes that have the job of finding mistakes. They are very useful, and their omission is a big red flag.
An aerospace software change would have a system requirement review, a code review a test review and a test result review at a minimum.
Since there is no hint of quality control people in the documentation of the change, I am guessing that there is not a quality organization within Compound. If this guess is true, then the error is at the highest level (which in this case, I guess is the DAO). There was a quality check run by Gaunlet that appeared to happen near the end of development and testing. This is a good step that should remain, but it does not replace a quality organization. A lean, skilled quality group (or person) would be money well spent for Compound — this is our first real improvement suggestion for a protocol as important as this.
Management
DeFi is justifiably proud of being lean & we love to critique traditional finance with its inflated staff numbers. However, as DeFi protocols manage values that routinely exceed millions (if not billions), more staff might be needed.
Looking at the documentation of the PR67 change, it seems that this change was close to a one man show. Another improvement this protocol might benefit from is a change manager, ensuring all the steps are followed, bringing in third parties and documenting the process.
Nevertheless, this suggestion must be qualified. A big disadvantage of adding a quality group is resolution of the arguments between quality, software and test. Management must remain the neutral third party making the final decisions– the DAO should be kept out of this for the decision making. Perhaps the management level exists at Compound, but this is not evident from documentation. Adding layers of management and quality may not feel good, but the bug in the code causing an issue as big as it did is proof that the testing team is far too lean.
Process
Change Management starts at the top, which in this case is a DAO. It starts with a documented process. For a quality team it requires a charter, a budget and a team. Once the teams are in place a change process details the steps required for all future software PRs. Once this is established, everyone else will follow the process because as is required for approval. When the next incident happens, the process will be improved again.
Avionics software drowns in process and cost which often deeply inhibits innovation. You can easily overdo necessary process. Nevertheless, given the process of this change’s implementation it seems that an addition of process is called for. As DeFi matures we will all have to work to stay lean.
Cost
Change Management takes time and people which adds up to cost. Generally, DeFi has enough money but spending it well through a DAO seems a weak point. When change management works well the extra effort seems wasteful because nothing happens, the changes just work. This is a good sign, but it lacks the sex appeal of other changes. Only the absence of change management grabs headlines.
Conclusion
Looking at this incident from a DeFiSafety perspective, I mostly see a need for more eyes on the change during the execution. Leaning on my avionics background I suggest things I know such as a quality organization and dedicated management. There can be better ways to achieve these goals: primarily through having more intelligent, independent people helping with the fix before it goes to audit.