Getting down with downtime

Insights from the development team on what happens — and what we do—when our database goes down.

CryptoKitties
CryptoKitties
3 min readMar 15, 2018

--

Before you ask, the header image is supposed to look pixelated. It’s art, okay?

It’s exciting to create a product like CryptoKitties. But maintaining that product is crucial—and rarely convenient.

Here’s what happened

On March 12th, at around 9:30 PM PT, our website went down and members of our team woke up (don’t judge when some of us go to bed; sleep is important). Our investigation showed that we hit a 100% CPU usage in our database, which is highly abnormal. That also meant that all of the connections available on our database were allocated. The API couldn’t serve any requests: hence, no website.

Getting to the root of the problem

We knew pretty quickly what was wrong, but we didn’t know how to fix it. Our initial reaction was to reboot our database. Its CPU usage went down, and it seemed like it might work… but then it went quickly back up. Problem not solved, and we needed to dig deep into its causes.

Artist approximation of the dev team digging into the issue.

A quick check on the statistics of our Postgres instance (our database system of choice) showed us that most of the connections that were hung up on our system were running the same SQL query. This query, used for displaying the Cattributes page, was hogging all the available resources. To remedy this, we decided to temporarily disable the Cattributes page. We needed to buy ourselves some time to figure out how to make that information available without re-triggering the CPU usage issue.

To make matters worse, as we attempted to shift our database to our failovers, we found out that due to a networking failure in our database provider’s platform, they were stuck in an infinite loop of restoring state and communicating with the main database, never actually becoming ready for service. This network error complicated an already complicated situation. We were running out of options, fast.

All the fixings

We struggled for about an hour trying to make our database come to life, but we called time of death around 12:00 AM. Taking a more drastic measure, we decided to bring up a whole new set of databases based on a recent backup. To do that, we had to reconfigure most of our existing infrastructure, leading to the extended downtime you all experienced.

Whew! That seemed to do the trick. We’re still not sure exactly why this happened, although our database service provider has assured us it won’t happen again. Luckily, when we re-enabled the cattributes page, everything was working exactly as it should.

Why we’re talking about this

It’s exciting to build products, launch features, and show off new designs. But even more important than all of these shiny things is having a product that works. And making it work is a daily task, with a lot of less-than-shiny things we have to do to make sure it keeps working. It’s the kind of thankless work no one will see if you’re doing it right.

We’ve received emails from folks who want to get into blockchain development because of CryptoKitties. And that’s awesome! But we’d be doing them—and our community as a whole—a disservice if we only shared the good parts.

If blockchain is going to reach its potential, it’s crucial to share our successes and our challenges. Understanding breeds community, and we want to breed the best community we can.

Well, that and cats. We want to breed some cool cats too.

--

--

CryptoKitties
CryptoKitties

Published in CryptoKitties

Collect and breed digital cats with CryptoKitties, the world's most successful blockchain game: https://www.cryptokitties.co/

CryptoKitties
CryptoKitties

Written by CryptoKitties

Collect and breed digital cats with CryptoKitties, the world’s most successful blockchain game — built on the Ethereum network.