Incident Postmortem: Slow Confirmation Popups
Beginning on Thursday, April 25, 2019, users interacting with dapps via MetaMask found some confirmation popups taking an extra minute to load. In this article, we’ll review why this happened, what we did about it, and what we’ll do differently next time to make this type of thing less likely to occur in the future.
TLDR; We had a process oversight, cached less than we could, and had a request timeout set too slow. We recommend Dapp developers register their app’s method names on the on-chain method registry.
Context of the Change
At MetaMask, we’ve long known that one of our most important duties is to make transactions safe and coherent for users to approve, and over our history, we’ve come a long way to making it safer and easier to understand what you’re doing with MetaMask, but we have a long way to go still!
Earlier this year we introduced integration with the Parity on-chain method registry, allowing developers to verify their function names, so we could render them on the confirmation screen.
We chose this approach because it relied on the same trust infrastructure the user already has configured (Infura by default, or any other provider they chose), to ask a smart contract registry that could trustlessly verify the method signatures submitted to it on-chain, allowing us to render a method name with a high degree of certainty this was the method name the developer used when writing the contract (hash collisions aside).
This approach is far from perfect: It lacks translation, or really any rich interpretation of the transaction at all, but it is a nice and easy thing we knew we could add that gives context to an otherwise hard-to-decipher screen when interacting with contracts you’re not familiar with.
Despite publishing that integration, many developers have not registered their method names on that on-chain registry, and our MetaMetrics program further emphasized to us just how important the confirmation screen is to our user experience.
This led us to pursue a fallback centralized method registry lookup, leveraging 4byte.directory by the wonderful Piper Merriam. Our confirmation view would check the on-chain registry first, and if it was unavailable, it would try 4byte.
The Nature of the Problem
Despite this reasonable plan, we had a few issues with the way it was ultimately carried out.
First of all, the directory lookups blocked the loading of the confirmation screen, which we already know is slower than ideal to load. This means if those lookups are slow, the confirmation window would also appear slowly.
Second, while we had good error handling for this endpoint, the timeout period for the 4byte lookup was not customized, and so it defaulted to one minute, and so if 4byte.directory was very slow to respond or throw errors, users would encounter up to a one minute wait before seeing a generic “Contract Interaction” confirmation window.
Third, we also added method name rendering to our transaction list, and these lookups were not locally cached, and so every user who loaded their transaction list essentially contributed to a DDoS of 4byte.directory during that time.
Fourthly, the change was reviewed and merged without coordinating with 4byte itself, and so 4byte was unprepared for the coming onslaught of user requests.
Combine the DDoS of the transaction list with the long timeout of the confirmation’s lookup, and you have some users, who are using unregistered method names getting one-minute confirmation windows.
Through this process, some people have asked us about the nature of our QA process, and we have a pretty thorough and documented process for a release, and we also rolled this release out gradually, starting with 3% of Chrome users on Tuesday, and didn’t roll out entirely until Thursday, but because the issue would only really manifest when 4byte went down, the issue didn’t really become visible until we had fully rolled out to production, Friday morning activity began, and the DDoS was in full effect.
Many users tweeted at us, and our support team regrets not having noticed the correlation of these reports, but taken individually, the issue was similar to a variety of other issues, like users with slow internet connections, and so they were helped in a more generic way.
We also received a number of reports on our GitHub, which allowed users to pile on to common issues, making it much more obvious that the reports were a new and widespread issue, and this ultimately is what got our team’s attention first.
Our Response
Once we were alerted to the issue, we identified the cause very quickly, and had a revert fix PR ready within an hour, which we then promptly published to the Firefox and Chrome stores.
We had team members communicating with users on GitHub and via email and social media about the nature of the issue, and how they could work around it by installing the new fix manually if needed, instead of waiting for the update to auto-install.
Unfortunately, the Google Chrome store also automatically flagged this particular release (the fix) for a manual review, which was a new procedure to us. We had no idea how long that process was going to take, and so we began working on alternative workarounds for scenarios where Google might take unacceptably long to accept our change.
We were prepared to heavily update the on-chain registry, in turn avoiding our 4byte.directory lookups, around the time that Google accepted our version update and it should since be live for all users.
Learnings and Next Steps
It has been a while since we had a severe usability issue like this, and so we’re taking the opportunity to learn as much about our process from this event as possible.
First of all, we’re enhancing our code review policy to include requirements about pull requests that involve third party APIs, including coordinating with them, appropriately caching those requests, and ensuring timeouts are considered with good user experience in mind.
It was suggested that we invite the community to facilitate in our QA process, and so we’re currently considering a community QA program where we would publish “nightly” builds for a period before going to production, with bug bounties on each one.
We plan to make our confirmation windows faster to load with several strategies, including allowing non-critical data to load in-line (and not block the rest of the interface from loading). We’re also planning to add much more thorough caching for these method signatures, to reduce the number of outbound requests, and we’re looking at hosting a CDN cache of method signatures as well, so we do not overly strain community infrastructure.
Finally, we’re also drafting a crisis response guide that includes a few tricks we came up with during this event to help us more effectively coordinate with the affected users on the many communication channels they reach out to us on.
If you’re a dapp developer, you can also improve your user’s experience by registering your function names on the on-chain registry as described in our docs here.
Conclusion
At MetaMask it’s our solemn privilege to facilitate you using the decentralized web, and as our user base has grown, so has our responsibility to bring you a solid experience. We’re proud of how long it’s been since we had a usability issue that affected this many people, but we’ve only gotten as stable as we are through rigorous discipline, and commitment to quality at all levels, and so we’re happy to take this opportunity to learn and grow, and we hope by showing you how we’re dealing with the issues we face that we maintain your trust that we’re here, taking your experience and safety seriously.