Migrating from on premise MySQL to Amazon Aurora with minimal downtime

Image from http://2ndwatch.com/blog/benchmarking-amazon-aurora/

Zoosk, like most other webapps, relies heavily on databases. More than 40 separate clusters of them, in fact. One of the first tasks we had as part of a migration to AWS (on a tight deadline) was to determine how to migrate to Amazon Aurora with minimal downtime.

Any production database installation should have a proper backup strategy. Luckily, our on premise MySQL installation used Percona’s Xtrabackup, which Amazon supports as a backup format to import into Aurora! We were able to easily tweak our existing backup process to use a supported file format, compressed xbstream. These backups were uploaded into a special S3 bucket along with a text file containing the Xtrabackup log line (see below) that has the replication position at the time of the backup, as our backups are taken from a replication slave.

innobackupex: MySQL slave binlog position: master host '1.2.3.4', filename 'mysql-bin.004270', position 275567830.

We discovered two main caveats with the S3 import process:

  1. Stored procedures
    Stored procedures were not imported into Aurora. We used mysqldump to dump our stored procedures and imported them using the mysql client.
  2. Users
    Some of our database users and passwords were not being imported into Aurora. We used pt-show-grants to dump users to import using the mysql client.

We developed a set of Python tools that we used to manage the process of creating new databases in Aurora. One tool was used to start the import from S3 process and provision the databases with standard settings. Another tool was used to setup replication from our on premise database. We set Aurora up as a slave of our on premise database using the mysql.rds_set_external_master stored procedure available in Aurora. This tool read and parsed the replication coordinates from the Xtrabackup log line mentioned earlier. We also needed to configure our on premise databases to ignore replication of some RDS specific tables.

Next, we created a script to “flip” database masters between our datacenter and Aurora. This script has a few main objectives:

  1. Update DNS records to reflect the new master (in multiple zones)
  2. Ensure no writes can happen on the old master (read-only mode)
  3. Ensure the new master can accept writes
  4. Be able to “flip” back and forth from Aurora/on premise
  5. Most importantly: ensure no data loss 😃

To do this, we carefully studied our on premise MySQL master management solution: MHA, which takes many steps to ensure that data is not lost when transitioning between masters. We were able to emulate much of its functionality, except for a few key caveats:

  1. Read-only mode
    Since Aurora’s master/root database user does not have SUPER privileges, we could not modify global variables such as READ_ONLY . To work around this, we switched out the security group on the Aurora cluster to be a temporary restricted group that only the host running the script could access, which has the same effect on our applications.
  2. “Assassinating” processes
    One of the key steps in MHA is to kill running MySQL connections (in our case, using pt-kill) after a short wait once the former master has gone read-only. This requires SUPER privileges, too. We worked around this by running pt-kill for each user that shows up in the process list while connected as that user.

The process that our replication “flip” script used looks roughly like this, given that A is the on premise master and B is the Aurora master:

  1. Flush tables on A
  2. Set both A and B to read-only
  3. Kill all processes on A
  4. Flush tables on A
  5. Wait for replication to be in sync between A and B
  6. Stop the slave on B
  7. Switch DNS records to reflect B as the master
  8. Unset read-only on B
  9. Reset slave on A
  10. Change master on A to B, start slave
  11. Reset slave on B

Using these tools, we were able to conduct tests with live production traffic and then switch back without data loss. We were also able to isolate these tests to a subset of our users, thanks to our horizontally sharded databases.

This testing enabled us to gain the confidence necessary to move on with our master AWS migration plan, which involved an “all-at-once” traffic shift.