How We Auto-Verify MySQL Backups Nightly at Next Big Sound

Zoh Rothberg
Making Next Big Sound
5 min readMar 7, 2017
http://www.gettyimages.com/license/471112275

A brief note on (unintentional) timeliness: I actually began this short article a few months ago, to provide a high-level overview of the system we use at Next Big Sound for MySQL restore and verification. It’s a coincidence that it’s being published shortly after the GitLab production data loss incident, and I sincerely hope this post does not read as a critical response to their situation. Even in the best of shops and infrastructures, incidents still happen.

I’m a paranoid person. As such, I make a decent systems engineer. By the nature of our work, systems folks not only create new systems, but also maintain those systems for their lifespan. Nothing teaches you the art of anal-retentive planning like knowing you will inevitably have to clean up some sort of catastrophic failure.

True to this anxious personality profile, the first thing I usually do upon coming into a new company is ask about the essential data being stored, then look at how that data is backed up, and how it is recovered. I can’t sleep comfortably at night (literally, I’ll be paged) until I know that the essential data our team manages is safe, verified, and ready to recover if an incident were to happen.

There are many different ways to back up your data, but no matter which way you choose and which way suits your organization best, it is always true that you should test and verify essential backups in an automated way.

Summary of Our System

With these sorts of thoughts in mind, I recently rebuilt our MySQL backup and restore infrastructure to include automated nightly restore and verification for our MySQL data using Percona XtraBackup. We are able to take these nightly backups without much stress on the live system, and without having to bring our database down.

Here is a brief overview of our newly redesigned MySQL backup system:

  • Nightly Backups: a cron job runs a wrapper bash script every night that utilizes Percona Xtrabackup to back up from our production MySQL servers, without having to take them offline. This information is copied locally, then rsynced to network attached storage for safekeeping. The bash script fires off a Nagios alert if the backup process fails in some way (or if the process never checks in within an expected period of time).
  • Nightly Restore: a second wrapper bash script runs on a separate, dedicated single restore MySQL server, identical to production in all qualities but size and replication status, and restores the most recent backup. After restoring, the script runs mysqlcheck on all the tables as an extra verification. It fires off a Nagios alert if the restore fails in any way (or if the process doesn’t check in within an expected amount of time).
  • ZFS Backups: as an extra/convenience: our MySQL instances reside on a ZFS filesystem, thus filesystem-level raw snapshots of the database drive are possible. We take these every four hours, stored locally. These raw snapshots are intended as a “nice to have, but not necessary” backup/restore, but are not verified like nightly backups (though we may choose to verify these at a later point, just because we can). Note that if you are using InnoDB as your MySQL storage engine, you don’t need to flush tables or lock the database in any way before doing a snapshot — just make sure to include the MySQL transaction logs in your snapshot as well.

In designing this system, as with all backups, it was critical to determine an ideal frequency of backups. In talking with other members of Next Big Sound, we decided that a nightly backup would be suitable for us to recover from, in the event of a catastrophic failure. We decided that because it’s more or less painless to set up, we would also include ZFS snapshots as a “better than expected” convenience if we ever needed to take a backup from them— but that we wouldn’t worry about automatically backing up/restoring/verifying actual backups from these snapshots for now (that can be a incremental improvement).

When designing your particular backup system, you will also need to think about how best to address your infrastructure’s specific needs. You should not do this in a vacuum: you should talk to other people in your organization, not just managers but developers, user experience folks, etc. Nobody likes talking about potential catastrophes, but it’s better to discuss these uncomfortable topics out in the open. Ask them not only what data you are legally responsible for, but what sort of data they want to keep, that would make their jobs easier. You will hopefully be able to eventually create a system that works for everyone, and that everyone is aware of, making difficult misunderstandings at the time of an incident less likely.

Future Improvements / Current Downsides

Our current system could be even better:

  • Could we also automatically create a backup from the the four-hour ZFS snapshots, then restore and verify them?
  • Could we introduce improved verification over our current restore-and-mysqlcheck method? Perhaps this would mean standing up a version of the Next Big Sound site against the restored database, and run a battery of automated tests to ensure that the restored site works as expected.

These are some improvements I hope to investigate this year. However, even with a minimal set of features, our basic restore-and-verification system was easy and quick to set up, and (thus far) has run effortlessly in the background. Verifying backups is an easy change that will add value to your infrastructure — I hope this brief missive has helped advocate for verifying at least one layer of MySQL backups in your organization, whenever possible.

Resources for Further Investigation

--

--