Taking the Pain out of RDB Migrations

Alberto Assmann
Project A Insights
Published in
6 min readMay 22, 2017

A common joke states that there are only two challenging problems in software development: naming and cache invalidation. I would add a third one: data migration in relational databases. But what makes migrating data so hard when we’re dealing with relational databases?

Schemes and Stored Data

In order not to get mixed up with what exactly we’re talking about here, let’s recall that there are two things which can change in the data storage. First, the schema, meaning the structure in which data is stored, can be altered. Schema changes require quite a bit of effort but are manageable. Secondly, the current data that needs to be migrated might change due to structural changes or because of changes applied to the data itself. Migrating stored data involves a lot more potential pitfalls than migrating schemes. So let’s take a closer look at how to minimize risks in this regard.

Migration Requirements

First of all, we should gain some clarity over our main goals, besides that of merely migrating data. As you probably have experienced already, each migration comes with some effort to plan, write, test and execute it. So we’ll probably never decide on the fastest approach.

Ensure Roll Back Functionality

We’ve all probably seen database migrations fail despite extensive preparation and testing. In a modern software development environment, the possibility to go back in time and revise changes is indispensable. This is also important for instance when we consider cases where data was dropped but is still accessed after the migration. No matter what sort of problems you encounter, being able to roll things back can save your life when migrating databases.

Minimize Downtime

Have you ever had to wait for what felt like an eternity until a store opened? Annoying, right? Now imagine a generic online shopper: He or she would probably feel the same, the only difference being that there are plenty of other shops open “next door” when you’re shopping online. So any downtime due to your migration will cost your company money, and maybe even worse, it will annoy your customers. But even if we’re not dealing with an online shop relying on your database, downtime always comes at a price.

Unfortunately, it requires a lot of effort to minimize downtime. Sometimes it might even make sense to close the shop for a while and risk a little downtime, for example during the night.

Avoid Losing Data

No data should be removed which is not explicitly identified for removal.

There are several other requirements we could consider, but as I would like to keep it simple for now, these are the three most important ones.

Structural Changes

When migrating databases, the first of two steps is migrating your database’s structure, often called schema.

There are several possibilities to issue a schema change, either by using a tool like propel, doctrine or Liquibase, or by manually executing queries like we did back in the day. While the latter can be dangerous, it gives you full control over your statements, which are typically generated by the mentioned tools and executed automatically. These tools in particular split the schema migration into two parts. They will generate a diff between the current database and the expected schema first and will then apply the right statements to transform the current state into the expected one. Even though you could do both steps at once, I would highly recommend reviewing and testing the generated statements in between to make sure everything is working as expected.

Another feature provided by these tools is the generation of automatic rollback statements. As this can be performed automatically for some statements, like an ADDCOLUMN which can easily be reverted to a DROP COLUMN, it’s nearly impossible to accomplish for all destructive statements. Or what would you do to revert a removed column, including the previous data inside it?

Revert Destructive Migrations

We can use a little trick to avoid such a situation by simply moving the column and data instead of removing it. Either move it to a new table or rename it — the approach pretty much depends on the use case.

When you are sure the data isn’t being accessed any longer and you haven’t heard any complaints or encountered any issues, you can safely remove it. While this will enable you to reverse structural changes in the database quickly, it will require you to do proper planning of migrations and to define what should be removed when. In addition, you should set up monitoring to prove that the column isn’t being accessed anymore. For our example, you could rename the column and check the database error log to determine whether the old column isn’t being accessed anymore.

Data Migrations

When migrating the actual data in the second step, we can make use of another trick. We split our data into data which really needs to be stored in the database and the data which does not. In an e-commerce setup, taxes could be an example of data which doesn’t need to be stored in the database. But how can we distinguish between data we need to put in the database and data we better keep outside? There are various indicators which can provide guidance in this regard.

By Change Frequency

Data which needs to be changed frequently or where we do not know how often it will change should be put into the database, as this is the fastest way to change it.

By Usage

Once our code directly depends on data and not only on the format, we should try to keep it outside of the database. A good example could be different shipping options which are processed by different providers. Here, we would need to write some code to process the shipping for each provider, which would then depend heavily on the correct data in the database. As long as there is no other need to store it in the database it would be fine to store it directly in the code instead and therefore avoid all related migrations.

On the other hand, data that we want to join with other data should be put into the database, as joining them within the code comes with some overhead.

Of course, putting data in the code comes with a drawback: A change in the data stored in the code would require a new deployment and therefore it’s not as fast as changing a row in the database.

By Size

A third criterion would be the amount of data and its size. If we expect a high amount of small datasets we should store them in the database to get the advantage of indices. On the other hand, we should store small amounts with big sizes outside of the database as the file system would be a better place and indexing wouldn’t provide that many additional benefits.

As I have argued, it makes sense to distinguish different use cases for your data to determine where to put it. It’s not always the best solution to put data into the database. In some cases, it will be more beneficial to keep it outside, for example, directly within the code. Sadly, there is no rule of thumb when to put data into the database and when to keep it outside, as it depends heavily on the individual requirements.

In real life, we do not always have clear, separated use cases — more often we have to deal with mixed types. In these cases, it makes sense to distinguish data that belongs to and will be stored in the code and data which will change frequently and should be stored in the database. A good example for that would be the already mentioned shipping methods. While the data to configure the method, such as active/inactive, API-Keys, etc. could be stored in the database and therefore be changed easily, all of the data that our code depends on would be part of our code and we need to combine them each time we act on it.

--

--