Maintenance Post-mortem

Alex DiCarlo
NEO Tracker
Published in
5 min readFeb 28, 2018

On February 22nd we planned downtime for NEO Tracker. We needed to resynchronize the database in order to fix a long standing issue with token transfers. In short, transfers appeared successful on NEO Tracker when they had actually failed on the blockchain, leading to incorrect token balances on affected addresses. Rather than show out of date and missing data on NEO Tracker while we resynchronized, we chose to take down the website for 10 hours. We gave advance notice in case people expected that they might need to move funds during the downtime. We tested the resynchronization process locally and it took around 2 hours on a single machine, but given that we would have additional latency due to interacting with a remote database, we assumed it would take longer in production. That and we wanted to have some leeway in case things went sideways, so we said 10 hours for the maintenance. And boy, did things go sideways. So what happened?

Fire, Fire Everywhere

The hiccups began almost immediately when we started the maintenance. Here’s a brief list of what went wrong:

  1. The build system failed, so we couldn’t actually get a maintenance page up in a timely manner.
  2. NEO nodes failed to bootstrap, so we had to re-upload a backup database.
  3. NEO nodes failed to connect to peers due to a bug, so they struggled to stay up to date.
  4. After resynchronizing, the database was corrupted and we couldn’t sync new blocks.
  5. The backups of the database were also corrupted and it took time to find one that was not corrupted.
  6. Web servers couldn’t respond with the maintenance page because the database was down.
  7. Once we resynchronized, the database backup process failed and we couldn’t bring up the replica databases.
  8. SSL certificate renewal went haywire because the web servers were not responding, causing us to be rate-limited and unable to renew the certificate.
  9. Throughout all this, we were dealing with a 1 MB/s upload speed, meaning every new build locally and every backup or bootstrap data that we had to upload took excruciatingly long.

And that’s just the high level, each of those bullet points could be broken down into more individual issues that we faced at each step. We definitely proved Murphy’s law, “Anything that can go wrong will go wrong”, to be very, very true.

Aftermath

We were down for almost 3 days, an unacceptable amount of time for a service that so many people have come to rely upon. We were humbled by the many supportive messages we received during the downtime and afterwards across social media. With that said, we know that we hurt not only your trust in NEO Tracker as an integral part of the NEO ecosystem, but in some ways hurt the NEO ecosystem’s reputation itself. We hope that you will give us another chance and let us earn that trust back.

We will post a full timeline for NEO Tracker development soon, but in the meantime, here are some concrete followups to ensure that this never happens again.

  1. Migrate to a battle-tested continuous deployment system, CircleCI, so that we exercise our builds and deployments more frequently, eliminating the chance of surprise failures when we need to do an update.
  2. Bring up an entire mirror cluster of the NEO Tracker MainNet service under https://beta.neotracker.io where we can continuously deploy changes and verify everything works as expected prior to updating the production service.
  3. Double down on stability and reliability fixes across our infrastructure, in particular, the database by instrumenting our code with error reporting to Raven.
  4. Test and re-test our certificate renewal process to ensure that it behaves correctly under all scenarios.
  5. Test the beta cluster under various failure conditions by randomly taking out pieces of the infrastructure to verify that the website still functions normally for users.
  6. Add stability & reliability monitoring and metrics to quantify the impact we have and track our progress.

In addition, in the short-term we’re working through some urgent database stability issues, but we hope to have that resolved soon with minimal downtime.

How can you help?

We have a long list of items to improve NEO Tracker beyond the followups above, yet we’re currently just a small team of 2 people. If you’re interested in seeing NEO Tracker improved with better stability, reliability and more features and would like to help, there are many ways to get involved, whether or not you code.

Donate

We are not funded by City of Zion or NEO, show no ads and provide a completely free service. We are currently funded entirely by donations. Whether you donate a drop of GAS or a full NEO, every amount counts and will be used to hire more developers and scale out our infrastructure. Donate NEO, GAS or NEP-5 tokens to AKDVzYGLczmykdtRaejgvWeZrvdkVEvQ1X.

Contribute

Are you a developer with a little bit of free time on your hands? Interested in contributing to an open source project? We’re always looking for passionate contributors! NEO Tracker will be open sourced soon and we’ll have plenty of issues for you to tackle. If you’d like to get started now, contributing to NEO•ONE would have immense impact. NEO Tracker runs NEO•ONE nodes as part of our infrastructure and our pressing need is for a large suite of tests to verify all of the node packages. This maintenance period was the direct result of a bug in the NEO•ONE nodes, so more testing will help ensure we don’t run into another issue in the future. Interested in contributing but don’t know where to start? Come chat with us in the NEO•ONE Discord.

We’re Hiring

Are you a full-stack engineer that’s passionate about building applications end to end, from the front-end UI users experience to the back-end infrastructure that powers it to even the release process that manages it all? Are you interested in developing blockchain applications? Are you located in or willing to relocate to the Seattle, Washington area? Or perhaps you know someone who is? If working on NEO Tracker sounds interesting to you or someone you know, please shoot us an email with a resume at contact@neotracker.io. Even better if you get to know us by first contributing to NEO Tracker or NEO•ONE.

Closing

We will post frequent status update on our progress to keep you in the loop, be sure to follow us on Twitter and like us on Facebook. We are going to work hard to earn your trust back.

P.S. — Check out our new blog. We’ll also cross-post here on Medium.

--

--

Alex DiCarlo
NEO Tracker

NEO Tracker founder. NEO•ONE creator and lead developer.