Why does our database fall out of sync?

CryptoKitties
CryptoKitties
Published in
4 min readMar 13, 2018

There’s nothing worse for a developer than the phrase “known bug”. It means we’ve identified a problem, know exactly what’s causing it and why, and still haven’t come up with a solution. It’s the Rubik’s cube with a missing colored square, the Gordian knot, the ball of yarn with three ends…

You may have noticed that once in awhile, your transactions don’t appear in your transaction log, and nothing changes on the website. Then, the next day, everything is back to business as usual. This is because of a problem with Geth, and it’s something we’re working hard to solve.

What the heck is Geth?

Every active participant in the Ethereum network (what we call a node), currently has to choose between running Geth or Parity. You need one of these programs if you’re going to mine, submit blocks, or keep track of the network.

If you’re not mining, a node has one job: to talk to the other nodes on the network about the state of the blockchain. When you submit a transaction to the Ethereum blockchain (either through a wallet or other methods), the transaction is ultimately received and processed by computers running Geth or Parity; these programs serve as our gateway to the Ethereum network.

Parity and Geth can also notify clients of events that are emitted when smart contracts execute, which is something we needed. When we started CryptoKitties, we couldn’t find hosted solutions that supported event subscriptions, so we were forced to host our own nodes; we chose to use Geth to power our nodes. It’s the official choice of Ethereum; plus, Axiom Zen is very familiar with Go, the programming language Geth uses.

If you have Geth, why do you need a database?

In the Ethereum world, you’re not supposed to read directly from the blockchain, because it causes an extra burden on the nodes.

So, as good Ethereum citizens, we have a database that stores information we gather from the blockchain. The website accesses the database instead of the network so it doesn’t slow things down. We run our own Geth nodes, monitor them for events and state changes, and store those events in a database.

This makes sure we don’t unnecessarily slow down the network. But if our Geth somehow gets behind, if it doesn’t update in real time, then we don’t capture those events, and our database doesn’t update. That means users submit transactions to the blockchain, but the website doesn’t reflect the current state of the blockchain (aka reality).

How does Geth get behind?

In the early days, Geth’s job was pretty easy; you could turn even the cheapest laptop into a active mining node. These days, network usage has skyrocketed, which means Geth has a lot more to keep up with.

Think about it this way: In the early days, any laptop could run Geth and become a node on the Ethereum network. Today, you need high performance SSD hard drives, fast CPUs or GPUs, and the fastest internet you can find; and even then, Geth can’t always keep up with the traffic. It’s becoming the case that you practically need server infrastructure to run an efficient node — which is basically centralization, but that’s an article for another day!

Every once in awhile, our Geth node gets overwhelmed and starts to fall behind. Once it does, there’s no hope — it never seems to be able to “catch up” again. The only solution is to open a new node, get back up to date, and kill the old node.

Luckily, Geth has a feature to getting a new node up to date fast. Called a “fast sync”, at the time of writing, it only uses 50 GB of data — but for a “fast” sync, the process is incredibly slow. They’ve just released a newer fast sync option, which will mitigate the problem a little, but the old one takes even a beefy computer around 24 hours to get up to date as a new node. That leaves us scrambling with an out-of-date database for a full day.

Not an ideal solution.

So what are you doing about it?

We’ve set up a system of alerts and monitors that warn us when our Geth node gets out of date. That lets us move more quickly to boot up a new one, but it’s not a perfect solution. We don’t have a system to always have a fast sync ready to go, and even if we did, it would be a bandaid.

Unfortunately, a more robust fix would involve fixing Geth itself. And while that’s something we would love to do (not just for the solution for ourselves, but for the good it would do the entire community), it’s not something we can prioritize right now. We’re exploring some exciting options — INFURA just added functionality for monitoring events, for instance, so we may be able to move away from needing our own nodes.

But for now, we’re stuck with a “known bug” — a thorn in our paw, but one that we now know how to fix. One Kitty at a time.

--

--

CryptoKitties
CryptoKitties

Collect and breed digital cats with CryptoKitties, the world’s most successful blockchain game — built on the Ethereum network.