Plug Post Mortem — 06/22

This is a summary of the events that lead to Plug’s downtime between June 8th and 9th, 2022.

Plug
Plug
4 min readJun 9, 2022

--

On June 8th 2022, development & usage of Plug was halted due to bugs in our system that were exposed from a flurry of outages spanning the Internet Computer & the AWS Lambdas that Plug’s cache uses to accelerate content delivery.

Rest assured that no assets or keys were ever compromised, only Plug’s interface was broken.

Before diving into the summary of the events we want to extend an apology to all affected — our diligence with our fallback systems will improve as a result of this outage.

Summary

Issues began on the 8th of July with the Internet Computer (IC) finalization rate slowing causing many calls to fail — this caused Plug’s queries for balances of tokens to fail. DFINITY was made aware of this issue and resolved it within half an hour. Our initial thought was this was the main culprit, resulting in our inaction while waiting for the IC to return.

However, around the same time that DFINITY was resolving their downtime, Fleek’s AWS account was throttled due to phishing reports on some sites hosted on Fleek. Both Fleek & Plug fall under the PsychedelicDAO product studio, which at the time meant sharing the same endpoint environment.

Once the IC came back, and Plug didn’t, we realized that the issue must also be persisting throughout our own infra.

But wait, I thought Plug was supposed to be built on decentralized systems… why the use of AWS? You’re right, Plug does not rely on centralized infrastructure like AWS to operate. However, what Plug does use AWS for is caching data for faster load times.

Token & NFT Balances

Underneath our cache layer Plug uses DAB, a registry service on the IC, to fetch token & NFT balances directly from canisters on the IC.

While only using DAB would be slower because the calls have to route to multiple canisters on the IC, it’s a vital fallback that ensures user’s data is always available due to the decentralized & non-custodial nature of the IC.

This combination of centralized acceleration along with decentralized layers for data-hardening usually works harmoniously — what ultimately broke Plug’s Token & NFT tabs was a bug that caused the fallback to fetch user data directly from the IC to not trigger when the centralized cache layer is down.

Activity & CAP

Centralized endpoint downtime also caused activity in Plug tied to CAP to fail. This is because Plug uses a centralized layer to accelerate activity processing & loading times that would be quite difficult to achieve through the IC alone.

Calls Via Plug

Making calls to the Internet Computer through Plug also broke during this downtime. The culprit? Another fallback bug.

Calls to the Internet Computer made through Plug are passed through a proxy server before being forwarded to the IC’s gateway for execution. This proxy server is used to seed CAP activity as described in our Plug V0.4.2 release.

When this proxy server goes down Plug should fallback to making the call directly to the IC’s gateway. This didn’t happen. The proxy server returned an error, caused by the same broken AWS endpoint, that was not accounted for in our fallback condition. This resulted in no fallback and no calls going through.

In the end, the AWS endpoints have been restored, Plug is back working, but we aren’t satisfied. Let’s go over what our next steps are to ensure that Plug’s centralized layers are nothing but frills on top of our decentralized infra.

Next Steps

Plug should never rely on centralized infra, the first step we are taking is fixing the two fallback bugs identified in the previous section so that they work properly. Without proper fallbacks, we’ll run the risk of being stuck at the centralized layer, like we did during this outage.

Making calls through Plug will also become more robust against gateway outages. We’ve added a fallback against the IC’s main gateway, should it fail, and will continue to add more as they come online.

Plug will also soon be getting a network selector which will allow developers to choose to use their local replica instead of the IC’s mainnet to interface with canisters while developing.

Lastly, we’ll also be separating Fleek & PsychedelicDAO’s centralized endpoints, ensuring a lesser chance of collision, should similar outages happen in the future.

Conclusion

This outage can be mapped to bugs that resulted in failure to fallback to decentralized bedrock during a centralized outage.

Had proper fallbacks been in place, Plug may not have been as snappy and fast, but would’ve continued business as usual.

We apologize for any inconveniences this outage may have caused, and sincerely thank everyone for their patience. If you have any questions, concerns, or recommendations for further decentralization of our stack, hop into our Discord.

Discord | Twitter | Website | GitHub

--

--

Plug
Plug

Plug is an Internet Computer browser crypto wllet & authentication provider.