Beta Launch Postmortem: How We Broke Firestore and What We Learned

Calvin Koepke
ADA Handle
Published in
4 min readNov 10, 2021

Our number one priority, in every internal conversation we have ever had, is to never concede the rights of our users. Ever.

This means we will never prioritize monetization, performance, scalability, or in this case transparency, in order to protect our brand — for as long as it remains in our hands. In that same light, we feel it is important to provide absolute transparency into what happened this past Saturday and give you guys the opportunity to form your own opinion and criticisms.

What Happened

On November 6th, 2021 at 2PM UTC, we opened the Minting Portal of our website to roughly 4,000 unique visitors and 16,000 page views.

Within about 3 minutes, we were getting reports of failed submissions for our SMS verification queue. The logs across our servers (which we were watching live) were going absolutely bananas.

To give context, in case you weren’t able to participate in the beta sale, the current flow of a phone number submission looks like this:

  1. A user submits their phone number.
  2. Our app fires a fetch request off to a serverless function.
  3. This function parses the submission.
  4. After getting the all-green, it does another fetch request to our NodeJS Express app and this is where all the magic happens, like verifying the phone number and storing the data in Firestore.

A Contentious Issue

At this point, submissions to Firestore started to fail due to high concurrency among the requests. Our functions were not optimized to handle the contention that can arise from this (especially since internal testing with our community was limited to around 10–15 people).

“Contention” is when two transactions are attempting to write to the database at the exact same time, and the software must choose which one to accept and which one to reject.

One thing we did not realize was that when submitting transactional data to Firestore if there is contention, failed requests will retry up to 5 times.

You can imagine, then, that if 4,000 people are all trying to submit their phone number (and frantically retrying after every failure in an attempt to get in line), and each failed request was retried an additional 5 times, we could end up around the recorded 315,000 write operations on the database. Add on top of this around 750k read requests, and our database was being put through a lot.

Still, Firestore should be able to handle this.

Fixing Firestore

Upon further research, we realized that we had constructed the code inside each transaction in a way that was much too complex for Firestore. To remedy this we separated out the internals of each transaction and simplified them so they could be more synergistic with the scaling capabilities that Firestore provides.

For confidence sake, we wrote several new load tests that replicated what went wrong. After replicating the issue, we tested and implemented the above solution.

Since then, we have confirmed our simplified transactions are more compatible with Firestore.

Our Load Test Results

After multiple reliability tests (operating at an average of 2000 concurrent requests) and at least one successful test at 6000 concurrent requests, we are sure that our database can scale with these newly written transactions.

Internally, we will keep cranking that number up to identify our limits in order to define where our database continues to be efficient under load.

A Golden Egg (of a bug)

Firestore’s exacerbated performance issues actually helped us uncover a more fundamental flaw that likely would have been missed until after our beta launch — so in this respect we are again happy (albeit saddened at the inconvenient user experience) that the Beta Sale revealed these weak spots.

The fundamental flaw in our infrastructure allowed for someone in the beta sale to receive a duplicate refund. And while this particular duplication was on the refund front, any duplication in the database is a concerning sign.

ADA Handle, more than any other NFT project to date, lives and dies on the guarantee of non-duplication. This is something that cannot happen.

In light of this flaw, we spent the next 48 hours investigating and pinpointing where these issues were occurring. Since then, we have set up systems to detect duplication on multiple levels of the transaction process and an in-app alert system to counter any possible future duplication that could occur.

Conclusion

We were advised not to be transparent about some of these things with our community as it may have adverse effects on our public-facing optics.

Still, we stand by our decision to always put our users first and decided to give full transparency into the inner workings of what happened.

We recognize the impact over the weekend was damaging, but again this is what Beta launch phases are for. To alleviate any future issues that may deal with real money, we have committed to only doing TESTNET launches until we are confident that our platform is 100% secure and optimal.

At that point in time, we’ll announce a new MAINNET launch date. As always, we thank the community for their continued patience and belief in our mission: easy Cardano addresses for everyone.

--

--

Calvin Koepke
ADA Handle

A robust blend of crypto and web3. Started The Crypto Drip + ADA Handle. Christian, dad, and husband. Derp #01734