What Should a PostgreSQL Migrator do?

One of the tricky parts of programming against a database is ensuring the database schema in production is compatible with the code. A delicate dance must occur between modifications to the database and deployment of new versions of an application. Even if we consider the simple case, deploying with downtime, it still requires effort to ensure the application can operate without errors.

Most web applications use a database migration system to manage changes to the database. Migrators provide a few useful services:

  1. They provide a list of both past and future migrations.
  2. They ensure that the same migration is only applied once.
  3. They apply migrations in a deterministic order.

There are two main use-cases for a migration system:

  1. Incrementally migrate a production database forward.
  2. Create a development database from scratch and apply all the past and future migrations in order.

Ideally, a migration system is able to create a development database that matches a production database. However, this is not always the case. Additionally, many migration systems are ill-suited for iterating against a staging environment.

The goal of this post is to explain how these issues show up, and to discuss the design of a migrator that can avoid them. We will walk through the design of increasingly complex migrators along the way.

The Simplest Thing that Works

If you are the sole developer on a project and you do not have a persistent staging environment, migration is a much easier problem.

A simple migrator can use a trick from “Conditional DDL” to create a migrate plpgsql function that is only executed if a migration does not exist. The basic idea is that we will have a list of files in a directory that are named VERSION-MIGRATION_NAME, for example, 001-user-table. A migration would look like:

SELECT migrate(1, $$
CREATE TABLE users (
id SERIAL PRIMARY KEY REFERENCES
email text NOT NULL
);
$$)

and the migrations table could look like:

CREATE TABLE migrations
( version integer PRIMARY KEY
);

We would apply the migrations in our directory with a simple script:

set -eu
FILES=/path/to/migrations/*.sql
for f in $FILES
do
psql --single-transaction --no-psqlrc -v ON_ERROR_STOP=1 -f $f
done

Simple Isn’t Perfect

A method like this works and accomplishes many of our stated goals for a migration system. It has some drawbacks:

  1. It runs each migration in a separate transaction. This is problematic because if some of the migrations succeed and others fail, there might not exist a version of the code that can run with the state in which we have left the database. Ideally, we would like the entire collection of new migrations to be applied together, or not at all.
  2. If multiple engineers use this system, they are almost guaranteed to have merge conflicts.

We can address #1 rather straightforwardly by combining all the migrations into a single file before running:

set -eu
cat /path/to/migrations/*.sql | psql --single-transaction --no-psqlrc -v ON_ERROR_STOP=1

We can address issue #2 by using a lockable global index (think of a shared file or wiki page). Another common solution is to use the date as the version number. Using the date has the advantage that migrations can be named offline; and it is superior for other reasons, as we will see. How ever we solve #2 leads to another class of problems.

When the Order Lies

We would like believe that the ordering of migrations in source control reflects the order in which they were applied to the database in production. This way we can run the migrations in sequence on a development database and end up with the same schema as production. When multiple developers are writing migrations simultaneously, it is easy for this not to be the case.

Let’s say three developers are writing migrations on different branches. The migrations are names 006-blah.sql, 007-foo.sql, and 008-bar.sql. The developers of 007-foo.sql and 008-bar.sql are able to finish their work before the developer of 006-blah.sql and their migrations are merged to master and deployed to production. Later 006-blah.sql is deployed. So the order the migrations were applied to the database in production was 007-foo.sql 008-bar.sql 006-blah.sql but when we run the migrations against a dev database they will be run in sequence, e.g. 006-blah.sql 007-foo.sql 008-bar.sql.

One solution to this inconsistent ordering problem is to require that migrations are renamed before they are checked in. So 006-blah.sql would have to get renamed to 009-blah.sql, assuming 9 was free; this shows a weakness in using an integer version as opposed to dates (for all practical purposes dates are dense). Let’s update our migrations table to use dates as the version.

CREATE TABLE migrations
( version TIMESTAMP PRIMARY KEY
);

Trust But Verify

Renaming can actually be tricky in practice. If you’re employing a merge queue and continuous deployment system, you would have to automate the renaming in that case.

Even if we are very diligent and try our best to prevent inconsistent orderings, it would be nice to ensure we have been successful. An easy way to do that is to extend our migration table with the column to record the time of creation:

CREATE TABLE migrations
( version TIMESTAMP PRIMARY KEY
, create TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
);

We can then review the order of migrations in the production database versus the order of migrations in source control.

Staging is a PITA

So far we have discussed the issues with ensuring that a dev database matches production. The assumption is that we are recreating our development database when migrating. If we want the database to persist, which is the case with staging environments where specific data is being used for an ongoing testing process, we will have to overcome different challenges.

Imagine we write a migration, deploy to staging and discover our migration has a bug. We then fix the bug and redeploy. Unfortunately, our migration will not get applied because the version number is the same as the previously applied buggy migration.

We could rename the migrations every time we make a change. Beyond being an onerous development process, this would expose another issue with our migration system: mainly, it has no way to alert us of migrations that exist in the QA database but do not exist in source control. Our migrator would happily apply the new migration while leaving the old one in place. This could lead to future problems when we deploy to production, because staging and production would not match.

To solve the “reapplying an existing migration” problem, we are going to add a hash column to our migrations table:

CREATE TABLE migrations
( version TIMESTAMP PRIMARY KEY
, create TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
, hash text NOT NULL
);

Or migrator will now have to generate a hash based on the contents of migrations and compare it against the existing hash. If the values are different for a migration with the same version the migrator will error, explaining that the staging environment needs to be reset.

Some migrators expect migrations to have a down or backwards method to undo a migration. I would advise against this. If the down method is written incorrectly, you will silently find yourself in a different state than production. It is better to use “point in time” backups to rollback the database to a state before the migration occurred. Alternatively regenerate staging. YMMV.

We should still error if we discover an applied migration is not in source control; but using the hash column allows us to discover that a migration needs to be rolled back without having to change the name with each modification.

Never Change, Migrations, Never Change

It would be nice if once we applied a migration to production, we never had to change the code. Unfortunately this is not the case.

The library that runs migrations can change in backwards incompatible ways. Even if your migrations are pure SQL, you can still run into incompatibility issues after upgrading PostgreSQL to a new version. Essentially, our migrations will themselves need migrations.

If we change an existing migration, the migrator will error when we deploy because the hashes will not match. We could just allow our migrator to include a method to update the hash column in production. There’s nothing wrong with this, but it feels icky. It would be nice if the migrations table was a record of what was actually applied. There is value in this for auditing even if the migration is no longer able to be applied with the current migrator/database version. We could always rollback to a prior version and recreate the exact database schema if we preserve enough information.

To improve the ability to understand how our database ended up in the state it is in, we will save more information. We will include two new columns, code and applied:

CREATE TABLE migrations
( id SERIAL PRIMARY KEY
, version TIMESTAMP NOT NULL
, create TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
, hash text NOT NULL
, code text NOT NULL
, applied bool NOT NULL
);

If we pass a special flag to our migrator, like --upgrade, it will not error if an existing migration has the same version but a different hash. It will instead add a new record but the applied column will be false.

The code column is the actual code used to make the migration. At this point we could get rid of the hash column if we like.

Saving the complete migration in the database helps us prevent bugs where we incorrectly update a migration. We can use the code column to ensure the migrations checked into source control correspond to migrations that were actually used on the database (assuming we don’t manually update that column).

Just to emphasize, this is not how one fixes a broken migration. There is no way to update a broken migration in production. You can only roll forward with new migrations (or in the case of catastrophic failure load up a backup, backup before you deploy people). The --upgrade flag is to cover a unusually edge case, a successful existing migration which becomes incapable of being used in development and staging. We store it in production database so the migrator does not think the changes are accidental (hash mismatch). This also gives us one place to store the migrations, both the ones that were applied in production and the upgraded ones we can use for development and staging.

If your struggling to think of why a pure SQL migration might need to be updated when the version of PostgreSQL changes, imagine you where using the => pre 9.2 or you were using the now deprecated timespan type. More examples can be found on this page: https://wiki.postgresql.org/wiki/Deprecated_Features#Version_9.2

The surface area for deprecation issues is much smaller for pure SQL migrations, which is an argument in their favor.

Conclusion

Thinking through all the issues I have run into with migrations during development has led me to this trust but verify view. The migrations table becomes the most authoritative source of what migrations have actually been applied. It is convenient to assume that source control has identical migrations in the same order, and as long as we verify “enough”, it will.

There are still other requirements. We need a way to take a database that is not yet using our migrator and generate an initial migration. This is what Flyway refers to as “baselining”. The design of the interface and its ability to help one understand and automatically address issues is a whole other conversation.

There are also entirely different issues to solve when scale becomes an issue.

The key points:

  1. Version is not enough: you want to know exactly what the migration is.
  2. Migrations change.
  3. Staging is different because you need to detect when to rollback migrations.

Have I left out major issues or workflows? Let me know before I try to put this ideas into practice and write PostgreSQL migrator once again.

Update: I wrote an alternative design here