14 Rules for writing a data migration

Guy Segev
Autodesk TLV
Published in
9 min readNov 24, 2018

During a lifetime of a software project, from time to time we are required to perform major changes, which may force us to modify our database to fit our new behavior.

Most of the time our changes will include simple schema migrations such as adding a new table, column or index, but sometimes we will actually need to migrate the data itself.

Data migrations can be required for many reasons. Changes in the business logic or even new tech requirements can lead us to perform data migrations.

How should we plan data migrations? What is important to remember? What are the best practices?

Here are 14 rules for writing data migrations

1. Don’t write data migration

I guess it is kind of ironic to start the list with this one, but it’s the most important rule.

Usually if we think we need a data migration it means we already have a lot of data.

Migration of large amount of data has many risks, many edge cases we didn’t even think about and lots of ways to mess up our data so we will need another data migration just to fix our bugs in our current data migration.

Unlike schema migrations, data migrations can take a lot of time. Sometimes, even hours if you have a lot of data. Rolling back, if needed, can have even more risks than running the data migration in first place.

This is why the first rule is to try avoiding it as much as you can.

2. Deploy your code changes before migrating old data

Sometimes the obvious needs to be said when writing rules.

Let’s say, for example, we have a “firstName” and a “lastName” column, and we are required to create a “fullName” column.

Before migrating old data, we should deploy a new version of our code to support our new field. Otherwise, while running the migration, new rows with empty “fullName” field will be created by users who use the old code.

After we support the new behavior (“fullName” in the example above), we will stay with a permanent number of rows to migrate.

Once we are not creating new “wrong” data anymore, we can run our data migration to complete the transition.

The only exception here is, if we have a tight deadline and a lot of data to migrate, we wouldn’t want to leave the most risky part to the end. If so, we would want to run our migration script twice. When we will need to run it very close to our deadline, it will be quick and safe run on a small amount of data.

3. Don’t try to optimize runtime

Data migration is a one time script with a lot of risk, we all know that. Few seconds of runtime won’t change a thing. There are many ways to reduce runtime. Parallelizing the work, with threads or multi processes, is a good example of accelerating the runtime on one hand. On the other hand you risk synchronization problems.

In such a risky process, don’t think about runtime, think about how to get your job done safely, with minimum risks.

4. Estimate your script’s runtime

“Knowledge is power“ said Sir Francis Bacon and we all know he was right.

Try to estimate how much time your script will run. Count the number of rows if you are iterating a DB table and/or run it on development environment with a data sample. It will give you a lot of information and will help you to make some decisions.

You might discover your estimated runtime is too long or really short.

If it is too long, you might consider options such as running the migration on a dedicated server, splitting it into few servers, each with different scope or refactoring some of your code.

If it is really short, you might consider simplifying your script or adding more tasks to it.

5. Write idempotent code

The main risk in data migration is the data itself. When you have millions of data rows to migrate, it is hard to predict all the different cases one can have.

Missing a case can cause our script to fail and exit. Investigating and debugging the failure might discover a new edge case.

What should we do then?

First option will be to add that missing case to our script. The second will be to ignore it. Anyway we will need to re-run our migration again to complete the job.

Writing an idempotent code, meaning no matter how many times we run the migration, the result will be the same. This is an important ability we should have.

Let’s take for example a DB table with a “version” column. As part of our migration we need to increment the version by 1. Let’s say we will update in batches (stay tuned for why running in batches). Each query will be:

UPDATE my_table SET version = version + 1 WHERE …

That’s an example for a non-idempotent code which can lead to wrong version values in case of multi executions.

How can we make it idempotent?

Adding a column (which we can drop after we are done) to mark which rows were completed, can be a possible solution. We can then run the migration as many time as we want.

Before each update, we will check if it was already updated so we won’t increment the version twice.

6. Run in batches

Running huge update or insert queries is a very bad idea. If you have set thresholds for a maximum query runtime duration (and I hope you do), it will most likely hit that limit and fail. If not, it will take a lot of memory and has a good chance of choking your DB server.

Many times we will want to get data from other resources. Preparing all your data for a one huge query can be a mistake.

Running in batches can help you isolate problems and let your service keep its resources for other purposes such as serving regular requests.

When running in batches, keep in mind the following:

1. Always use order by in your query. Otherwise you will repeat or miss records.

2. Do not base your query on data you are changing.

For example, if you have a table with two columns, firstName and lastName and you want to create a third column fullName . What will happen if you’ll loop your database with the following query — SELECT * FROM my_table WHERE fullName IS NOT NULL LIMIT 1000 ? If some of our iterations will fail for any reason, we will keep selecting them, which will lead us to an infinite loop.

Another common mistake is to try to use OFFSET and LIMIT together with filtering data we have just changed. For example — SELECT * FROM my_table WHERE fullName IS NOT NULL LIMIT [N] OFFSET [N*Index]will cause us to skip records. Since our total count is changed after each iteration, we cannot keep bumping OFFSET value.

7. Use a SQL transaction for each resource

After retrieving data in batches, we have two more steps to go. First, is to process the data. Second is to save it back to our database.

How should we save it?

When saving, usually we can’t do it in batches. We need to run update queries one by one.

Sometimes, update a resource will require multi queries. For example, if we need to update its dependencies as well.

We want to update the whole resource together or fail. For that, we have SQL transactions.

We want to have a transaction for each resource along with marking it as completed.

It might take more time for each one, but it will save us from wrong behaviors in case of failures and re-running our script.

8. Blue-Green deployment

Blue green deployment is a method to deploy our code with zero downtime. Try to implement these principles when running data migrations.

For example, if we need to change data in a column, we will create another column and run our migration. Once we are done, we will switch between the columns by renaming them. After testing and validation we can completely drop the old column.

Pros:

  • We kept the old data so we can easily rollback
  • We can expose all our migrated data together and give a better experience to our users

Cons:

  • It is more work, including deploying a code to maintain both columns before starting the migration

Using these principles will give you tools to run safe migrations with minimum risk of loosing data.

Better be safe than sorry.

9. Write lots of logs

Once we are running on production environment, weird things can happen. All kinds of unwanted cases and mystery data values can show up.

Unlike regular HTTP requests, which we can try again and again easily, simulating failures in data migration can be a very challenging task.

Our only savior, in those cases is our log.

Write as many logs as needed. Don’t worry about space. You can delete it once the script is completed successfully.

Write errors, warnings, report progress and calculate runtime. Your log is your eyes and when your script is running, you want to keep them open wide.

10. Errors — Skip or Stop?

We should think before about unexpected errors like we do in any code we are writing. But in a data migration we should think about another aspect.

Which errors should cause us to completely stop our script? Which errors are bad, but we can write them as a warning to our log, fix them later and re-run our script again just for those corrupted records?

Usually, if the error indicates that we have a bug in our script which can lead to wrong migration values for the next records, we should stop our script.

Another reason could be a bug which would cause all the script to run for nothing. For example, wrong url to a third party — to keep running once we discover it would be a waste of time. All calls will fail anyway.

Otherwise, we can mark those rows, and let our script continue running so we can finish at least the ones it can.

Be mindful of the errors, don’t just skip or stop every time.

11. Third party calls

What if we need some data for our migration from a third party? How should we approach that?

The first answer is, please don’t. It will increase your runtime dramatically and add tons of unexpected behavior.

A good way to avoid it, is to fetch all the data in advance. If you are the owner of that third party, you can create a temporary dump of the required data in your DB.

What if we considered all options and still making calls to a third party must be part of our migration?

If so, consider the following:

  1. Consider stopping your script in case of 4XX responses (beside 429). It may point on a bug in your code implementation.
  2. Keep running if you got 5XX responses, but make sure to log or mark those errors.
  3. Check if the third party server has rate limiting. If so, make sure not to reach it.
  4. Consider using a retry mechanism for your calls. Especially for errors such as 429 (too many requests)

12. Rollback plan

Shit happens and when it does, we should be prepared.

The range of reasons for rollback can vary, from human error to wrong data modification.

If we used a blue-green strategy (rule #7), rollback can be really easy and fast. If not, it can be hell but still doable.

Anyway, keep that in mind. Plan it ahead and test it on development environment. Make sure to backup your data before as a second rollback plan.

A good rollback can save your life in case of a disaster.

13. Validate your migration

Once you are done, build a confirmation script to validate your job.

If we will take our example from before about merging firstName and lastName into fullName column. A good confirmation strategy can be the following query:

SELECT COUNT(*) FROM table WHERE CONCAT(firstName, ' ', lastName) != fullName

Getting zero will indicate we were really done.

It can be part of the script, or we can run it manually. But we definitely need to think ahead about a strategy to make sure we are really done.

14. Take your best mates for this mission

This one is not really a rule, but more like an advise.

Data migration can be an unpleasant journey. It can run for hours and keep you awake in 3AM at the office.

Take your best friends in the team to help you out and be your company for a pizza and ping-pong while the script is running.

That’s all folks! I tried to group all the knowledge and lessons we have leant in my team to write a good data migration. Yet, data migrations can be very different from each other. Not all of them are required all the time but being aware of them can save you a lot of time and frustration.

Good luck, and remember, data migrations are hell :)

P.S — Developers in Israel? Autodesk TLV is looking for you. Apply here.

--

--