That time my infra startup ruined Christmas for thousands of children

Taylor Hughes
5 min readMay 3, 2020

--

Hi, I’m Taylor Hughes. I’m an Engineering Manager at Facebook working on new products. In a previous life, I started a startup building developer tools and photo-sharing apps, and wrote web software at Google and YouTube.

An Amazon review that was my fault.

I woke up Christmas morning in December 2015 to a notification on my phone: My startup’s backend API was unresponsive.

Usually on Christmas I’d wake up, have a cup of coffee, eat some overcooked eggs and bacon, and sit with my family while my niece and nephew tore into a stack of presents — but this year was going to be different, and I learned a couple of things.

  1. First, never ever ever let your SDK crash your client’s app. Do whatever you can to prevent this from day 1. This is extremely important.
  2. Next, if you’re building an infra service that others depend on, reach out to your early users and find out if they have any important launches coming up. You might even ask if there is a specific date — like Christmas morning, for example — that is of any importance to them.

The following is a cautionary tale, which resulted in a very angry customer and a lot of sad kids.

Since leaving my corporate job in 2012, I had been oncall essentially nonstop for almost 4 years — first for a series of backends for our apps, Cluster being the main one, and later, for our in-app analytics and feature flags platform, LaunchKit.

Our startup was about 3 engineers at the time, and our in-app analytics service carried a proud “beta” label. We only had a couple hundred apps using our SDK, mostly tinkerers that hadn’t shipped to the App Store yet.

We didn’t have downtime as a result of any kind of scale, just small misconfiguration issues most of the time. Most outage notifications I got had something to do with a stuck task queue or a non-responsive database, where I just had to quickly ssh in and kick something in order to get the service back up.

So when I logged into the Amazon AWS app on my phone on Christmas morning, I was not expecting the first graph I saw: There was a giant increase in traffic to our tracking service, starting just an hour or two before 9 a.m.

The spike also looked organic: it wasn’t a squared-off shelf or a single spike occurring over the course of a minute or two — it was a buildup from many distinct remote IPs over the course of the morning, followed by a cliff where our servers obviously could no longer handle it.

When the service died, we were seeing about 20 times the traffic from the previous day.

The very bad fix

Looking through the logs, it became obvious that the analytics product was getting slammed — all from a single SDK app ID I hadn’t heard of before. The app name had something to do with a race car.

Heart pounding, I needed to get the service back up — particularly for everybody who wasn’t that race car app, including our own apps and some paying customers using another service we built.

I decided to shut off the analytics product just for the race car app, and return an innocuous JSON dict — basically {"result": null}. That would remove all the database load and allow us to serve a bunch of very quick responses without doing any real work, and I’d increase capacity later. Easy peasy.

I pushed this null response at around 9:30 a.m., verified the main services were coming back to life, and went to go open Christmas presents with my 3-year-old nephew.

The “oh shit” moment

It took about an hour for the CTO of the race car app to figure out how to contact me.

He tried our support queue first, then DM’d our CEO on Twitter, and finally tried got ahold of me by sending me a LinkedIn invitation — the most desperate of InMails, no doubt.

Turns out the null fix was Very Very Bad. Really, the worst. I’m filled with shame writing about this, even five years later.

In the version of our SDK the race car app had installed, our SDK crashed the app when interpreting the NSNull API result as a dictionary. We had fixed this behavior months before, in a previous SDK update — it is obviously terrible — but this race car app hadn’t updated to a newer version of our software.

At this point I realized what the race car app actually was: It was a companion app, for a toy, for children. And I found the stream of Amazon reviews for it.

Children were opening their Christmas presents and trying to play with an awesome new race car toy, and we were doing the worst possible thing by crashing the app they needed. The toy didn’t work without it. The app crashed instantly.

Fuck.

The Amazon reviews for the product actually still reference the crashes on the third or fourth page — so I can look back at them even now and continue to feel deeply embarrassed.

The aftermath

I immediately pushed a better fix — something like {“response”: {}} — which resolved the issue and still kept the load down on the service.

The next few days and weeks I pushed a dozen better fixes, including a bunch of autoscaling and resiliency work so this wouldn’t happen again. Later that day, I brought up more EC2 instances from the passenger seat of a car while my wife drove us across the entire state of Wisconsin for the rest of our Christmas Day activities.

But there’s nothing like crushing children’s dreams to make you think really hard about all the mistakes you made when designing your backend architecture.

Next time you’re building a service that other apps depend on, make sure you can’t cause their apps to crash — even if your service gets overloaded and you push an awful, tiny, innocuous fix.

Thanks for reading! If you enjoyed this article, I would really appreciate you giving me some claps or leaving a comment. Thanks to Chris, Riz, Noam, Brenden and Carolyn for reading early drafts of this.

--

--