Our Plan for Zero-Downtime Database Transitions with Rails: Part 1
Database migrations are causing scheduled downtime. You’re working on a Saturday morning. Customers are getting alerts about service disruptions out of their control. These problems are unnecessary and avoidable.
With an upcoming Postgres upgrade on Instrumental, we obviously want to avoid these types of problems. There are plenty of tools that claim to alleviate the pain of database transitions, and some of them even do…mostly.
Fortunately, we already have an in-house tool we’ve dubbed as ActiveMigrator. It uses Rails’ ActiveRecord to perform zero-downtime, database-agnostic migrations. We’ll share source code and performance graphs in a future post. First, we want to share our requirements for a migration tool and how ActiveMigrator achieves those requirements
- Moving a database to a new server should be unnoticed by customers.
- Moving a database to a new server should be transparent to developers.
- Data in the new server should match data on the old server with zero or minimal delay.
- All data that is moved should be inspectable and verifiable for correctness before a switchover by the people responsible for the database move.
How does ActiveMigrator fulfill these requirements?
Moving a database to a new server should be unnoticed by customers
When moving a database, we don’t want to alarm customers, and we never want to disrupt their business with our business if we can avoid it. This means we should have minimal performance impact from the move, and we shouldn’t have noticeably reduced functionality during a transition.
We need data to be moved quickly and transparently. We want to deploy new code that uses the new database, and for the impact to be no more than any regular deploy. This means all the data must be in the new server’s database by the time the deploy finishes.
We achieve this by duplicate writing the data to the new database-in-process as soon as the source data has been committed.
Another constraint of the first goal is that it should be easy to test the new database without impacting customers. We use a small class we wrote called Levers which allows us to tune various parts of our application in real-time. With our Levers implementation, we store the percentage of records we would like to transition at any given time. When we deploy, this is set to 0% so none of the transition code will run. This should have no impact on our application whatsoever. Then we can turn this up to 1% and verify the transition code can record data in the new database.
If something has gone wrong with the setup of the new database, there might be exceptions at this point. Since we don’t want to impact customers, and our application doesn’t actually rely on the new database, we catch all exceptions related to the new database and report them to our exception handling service.
Now we can turn up the transfer percentage in order to verify the write load on the new database. We ramp it up until we get to 100%, and now we know the new database can handle the current application write load.
On the off-chance something goes wrong with the transition deploy (where we make the new database server the primary server), we want an easy path to roll back the change in order to keep any impact at a minimum.
To solve this problem, the transition deploy will flip the databases we’re using. Instead of updates writing data to the old database first, then copying to the new database, we write to the new database first, and copy to the old database. This means that if we need to deploy the old code for any reason all of the data will exist on the old database and the application should continue to run normally.
Moving a database to a new server should be transparent to developers
When a developer wants to change the existing database, or deploy code changes that alter data in some way, it shouldn’t cause operational problems for them, or for the developers in charge of a database migration.
Duplicate writing allows developer changes to the data to be propagated to the new database immediately, with no specific handling for those changes necessary.
Data in the new server should match data on the old server with zero or minimal delay
When the server has to be switched over, all the data on the new server should match data that customers submitted. That means there can be no delay between primary data writes and writes to the new server.
We need to worry about all the old data in the system that isn’t being regularly updated, and therefore isn’t getting propagated to the new database. For this, we use a job system to queue up jobs that transfer all the old data.
Since the database may contain a huge number of records, these jobs will run in batches. For example you might have each job handle 5,000 records so 1 million records only result in 200 jobs instead of 1 million.
All data that is moved should be inspectable and verifiable for correctness before a switchover by the people responsible for the database move
It should be trivial to find out differences between live data on the two servers at runtime from within Rails console, and to query the new data in the same way. It should also be possible to find out what percent of the new server transition is done. Any data that is missing or incorrect should be easy to find and fix.
We need a mechanism to run through all the data and verify correctness. Since we already have a job system to ensure older data gets transferred, we add verification code to the job as well.
If a job finds an old record and a new record that have different data, it can add this record to a suspect list so it can be verified and rectified by a developer later, before the transition.
Addendum: data conflict resolution
Underlying all of the above points, we need a mechanism to prevent conflicts of data in a distributed system. If you have two requests come in that try to update the same record at the same time, you could have the updates happen in a different order on the source and destination databases. In order to stop this from happening we use a distributed redis lock around each record update so one record can only receive one update at a time, and propagate that change to the destination.
This sort of locking is called pessimistic locking, and can cause issues if you have very contentious records. It forces all of the updates for a record to happen serially, and wait on each other to execute. This could cause 1,000 updates to the same record to happen 1,000 times slower than they otherwise would (this is an exaggeration, but you get the idea).
In our Part 2 of this series, we’ll show you code samples from the ActiveMigrator project and some graphs of what happened on our servers as we migrated databases. Please follow our blog to be notified when we post the update!