Our Plan for Zero-Downtime Database Transitions with Rails: Part 2

Expected Behavior
Expected Behavior Blog
6 min readMar 5, 2019

By Nathan Acuff, Jason Gladish, and Joel Meador

In our previous post, we outlined the requirements and an abstract solution for migrating from one database to another. In this post, we’ll start digging into the code and see how we can meet those requirements. We’ll also demonstrate some production instrumentation related to the code included in this post.

As a refresher, here are the requirements we set while implementing our database transition:

  1. Moving a database to a new server should go unnoticed by customers.
  2. Moving a database to a new server should be transparent to developers.
  3. Data in the new server should match data on the old server with little-to-no delay.
  4. All data that is moved should be easy to inspect for accuracy by those making the transition, before it is utilized.

Below, we refer to the primary database from which we are migrating as the “old database” and to the new database as the “new database.” We describe the flow of data using “source” and “destination” (Spoiler: later, we’ll see the source and destination flip, so we can roll back if we want).

The Happy Path

We mentioned in the previous post how requirements #1, #2, and #3 can be accomplished by duplicating the data immediately after the source record is written. Let’s look at some simplified code:

This is the initial entry point. Our source record is saved, so we want to propagate changes to our destination database. That filters down to code that looks roughly like this:

Here, we see the meat of the data duplication. If we have ActiveMigrator enabled for this record, we check to see if we just performed a save or a destroy. If the source record exists, it was a save, and we therefore need to propagate changes. Otherwise, it was a destroy, and we need to remove the destination record if it exists.

In the case of a save, we would find_active_migrator_destination_model. This would find_or_initialize a record for us to update on the destination database.

Here, we also see transformed_attributes_for_active_migrator, which allows us to define transformations from the source data to the destination. In our case, we’re just doing a direct copy, so nothing too interesting happens there at this point. In a future post, we’ll show an example of altering data as part of using ActiveMigrator.

Once we’ve updated the destination record, we save it. Now, we have a copy of the source data in the destination database.

With the code shown above, our source model is just the plain old ActiveRecord model in our application. Let’s take a more detailed look at the special destination model that is part of ActiveMigrator:

Here, we see the superclass of all destination ActiveRecord classes. We also see the code that creates the destination class when ActiveMigrator::ActiveMigratorSourceis included on the source model’s class.

The idea here is that the destination contains very little logic. It’s storing data copied from the source class and nothing more. It shouldn’t worry about STI, timestamps, or table-level locking. Those values will be copied from the source ActiveRecord object. These classes are concerned with where the data is being stored. We will therefore configure a database connection and a table name where the destination data can be stored.

The above code glosses over some details, but you can see the basic premise.

In the general case, this roughly satisfies our database transition requirements #1, #2, and #3.

Error Cases

What if something goes wrong while copying data? We wouldn’t want customers to be impacted, so we want to be very careful about how we proceed in the face of errors. Let’s take a look at the above code again, but let’s focus on more of the details:

Notice that this is an after_commit, instead of after_save, or even before_save. This means the database transaction for the source update has already been committed to the old database before we even started to interact with the new database. This means the data is immediately available to other processes to read, and no issues related to the new database could impact the source update.

You may have noticed the method signatures above didn’t quite match (hint: the “!”). Here’s the missing method:

We want to be very sure that anything that goes wrong when interacting with the destination database doesn’t impact the application. Also, we want to make it easier to investigate issues by sending some additional information to our exception-handling service.

Conflict Resolution

Say two users are using Instrumental while we’re copying the database data related to graphs. What happens if they both edit which metrics are shown on the graph at the same time? Well, user A’s edit could be applied to the source database, and then, user B’s edit could be applied to the source and the destination database, followed by user A’s edit being applied to the destination database. This would result in the source database having user B’s edit, and the destination database having user A’s edit. They wouldn’t match, leading to errors when we switch to using the new database.

To solve this problem, we use a lock to ensure that database updates happen serially, rather than in parallel as shown above. The order of operations looks like this:

  1. Before saving a source record, acquire the lock so no other edits impact this record
  2. If this save fails, release the lock since nothing was updated
  3. The after_commit we talked about above copies the source data to the destination
  4. An after_commit on the destination record releases the lock

Because of the multiple moving pieces and additional error handling, the lock code ends up being a bit spread out. We’re in the process of pulling ActiveMigrator out of Instrumental and into its own repo. In a future post, we’ll link to that repo with the full source, so you’ll be able to see the nitty gritty details of how the locks prevent race conditions, as well as how we handle the possibility of failures in the distributed locking system.

Testing in Production

Now, let’s see what it looks like to do a quick test in production to make sure things work at a basic level.

We deploy all of this in a disabled state to start. This allows us to make sure we haven’t made any errors unrelated to ActiveMigrator. Then, we slowly enable live data copying for increasing percentages of our database records:

We’re using Instrumental to monitor itself here, and we see the process of ramping up data copying from 0% of records to 100% of records. The gray vertical lines are Instrumental notices corresponding to changes in the percentage of records being copied to 1%, 5%, 25%, 30%, 50%, 75%, and 100%. The left axis is a scaled representation of the number of records, and as expected, it roughly matches the percentage to which we have the lever set.

This ramp up serves two purposes. First, we make sure that everything functions as expected. At 1%, enough data is being processed that we’ll be able to see any exceptions if they occur, but there should be no significant impact on the app.

Second, once basic verification is complete, we continue to ramp up while making sure there are no unexpected performance impacts on our application. We increase the percentage of data we’re copying and let things run for a few minutes, verifying that all of our other application metrics are within norms. Once we get to 100% and everything looks good, we’re confident that realtime updates to the system are being propagated to the destination database with no impact.

Coming Soon

We’ll talk more about production verification in the future, along with how we handle old database data, configuration, record conflict resolution, and more. See you next time!

--

--

Expected Behavior
Expected Behavior Blog

Official account for Expected Behavior. You can tweet us at @EB.