Who Let The Cats Out (Meow, Meow, Meow-Meow)
As some of you may have noticed, something strange happened on CryptoKitties this past weekend. For about ninety minutes late Saturday afternoon our Gen 0 Kitties began multiplying less like cats and more like rabbits.
Due to a database replication error, Kitty Clock released over 400 of the same two cats at the rate of a few dozen per minute. This is, of course, rather than releasing one Gen 0 Kitty with unique genetics every fifteen minutes, as intended.
Perhaps our old nemesis Doc Purr finally unlocked the secret to Kitty cloning technology, but if he did, he quickly lost control of it. We temporarily suspended Kitty Clock to stem the flow of clones and bought back as many as possible. We managed to collect 367 of the 405 released. The rest were bought by users and should make nice, if unintentional, limited edition mementos. We’ve also tweaked some things behind the scenes to prevent this same issue from repeating itself.
For now, we plan to re-release the clone cats we bought back into the wild once they’ve been fully rehabilitated.
The technical details
For those interested in the technical details, this is what went down. The CryptoKitties database is divided into a master and replicas. All updates are made to the master database and then duplicated in the replicas. On Saturday a hiccup in that replication process caused the job that releases Gen 0 cats to produce the same cat over and over.
While we’re not completely certain how this happened, we suspect it involves one of our endpoints that runs a particular query. The query in question has lengthened in execution time as the number of Kitties has grown. We don’t run the query every time since we cache its results, but a job still runs every ten minutes to refresh those cached results. Problem is, the query now takes more than ten minutes to complete, and soon the jobs that run it started holding all connections available on the database leading to increased CPU usage.
The CPU problem first arose on Friday. Our database hit 50% CPU usage, so we killed all pending queries and also reduced the job execution to daily instead of every ten minutes. The problem persisted the next day on our China cluster, where the job still repeated every ten minutes. We restored service at 4:00 PM by killing all queries again–don’t worry, they felt no pain–and preventing the job running the faulty query again.
Our best guess is that all this stressed the master database and prevented it from replicating properly. We tweaked some things behind the curtain to address these issues and also added more logs to the Gen 0 release so it will raise alarms if things go wonky.
As always, thanks for your patience and understanding when it comes to issues like this. We’re getting better every day, but we couldn’t do it without the help and support of you cool cats.