Making Alembic migrations more team-friendly

Published in

Alan Product and Technical Blog

8 min readAug 3, 2023

At Alan, our backend is written in Python, with Flask as our web framework, and SQLAlchemy as the ORM to interface with our PostgreSQL database. The SQLAlchemy project comes with a tool for writing database migrations, Alembic. We use Alembic primarily for modifying the database structure as needed, but also for data migrations when they’re not too long or risky.

At the time of this writing, we have more than 932 tables in our PostgreSQL database, modeling all the data needed for operating our health services in 3 countries. 80 engineers are making changes daily on this database, to build new features, add constraints, improve data quality, and so on. And we’re growing, both in terms of services/features, and in terms of employees. Hence more and more migrations every month:

As a result, we found that Alembic’s rigid dependencies system was introducing some friction in our development process, and it would only get worse in the future. This is the story of why and how we got rid of it.

How does Alembic work?

A database migration tool lets you write scripts called migrations (or revisions) that describe how to change your database going forward.

Such tools generally have the following properties:

migrations are ordered: the tool knows which migrations to run, in which order;
migrations are reversible: they have a “up” mode, but also a “down” mode for reverting changes in case something bad happens;
migrations are versioned alongside your application code;
the database has a special table, where it holds the state of the schema, so the migration tool can play only missing migrations the next time it’s invoked.

Here’s the structure of a (real) migration in Alembic, that adds a authentication_id column in the user table:

% cat migrations/versions/556dbdeb7e81_add_authentication_id_to_user.py
"""Add authentication ID to user

Revision ID: 556dbdeb7e81
Revises: 12acc27ebaa7
Create Date: 2021-10-04 17:18:42.481218
"""
import sqlalchemy as sa
from alembic import op
from sqlalchemy.dialects import postgresql

revision = "556dbdeb7e81"
down_revision = "12acc27ebaa7"
branch_labels = None
depends_on = None

def upgrade():
    op.add_column("user", sa.Column("authentication_id", postgresql.UUID(as_uuid=True), nullable=True))

def downgrade():
    op.drop_column("user", "authentication_id")

If you look carefully:

the migration number is 556dbdeb7e81: Alembic numbers migration with random 13 characters identifiers
the previous migration is 12acc27ebaa7: Alembic maintains a parent-child relationship for all migrations, so it can run them in order
it has an upgrade() function, which is the way forward, and a downgrade() one that inverts the operation and that you can trigger if needed

It’s interesting to note two specific design choices of Alembic:

it is imperative: changes are described as imperative steps (e.g. “add a column”), which is a great idea since it makes them auditable, you know exactly which operations has been performed (some systems use a description of the final state, they’re declarative)
it maintains strict & explicit dependencies between migrations. Some other systems decide to have migrations numbered sequentially (e.g. 001, 002, 003, …) or they adopt a numbering convention that keeps migrations loosely sorted (e.g. 2021110101, 2021110102, 2021110401, …) without strict constraints over which migration follows which one (example: Ruby on Rails)

This last design choice is questionable. It’s a great idea on paper, and allows for alembic to only store the last migration that’s been run. But generally database migration systems that take this road don’t tolerate that there are ambiguous ways of running migrations in order. Let’s take an example.

The problems

Let’s say the current last migration (aka “head” in Alembic parlance) is “aaaaa”. Two developers work on their own branch, and want to introduce database changes. Bob wants to introduce a migration “bbbbb” and Charlene wants to introduce a migration “ccccc”. They both declare a new migration whose parent is “aaaaa”. Their migrations will work well on their own branches separately. But once the two branches are merged on the main line, there’s something ambiguous: the system doesn’t know which one of “bbbbb” or “ccccc” should run before the other.

This is exactly the problem we were facing at Alan. Multiple times a week, branches that were working perfectly fine in PRs would be merged, and break the ability to run migrations on our main branch. At this point, someone would say on Slack “hey, build is broken on the main branch, hold on everyone I’m fixing”.

The fix can be changing the parent of one of the two migrations, or generating a “merge” migration that has two parents (described in details in Alembic documentation).

The point is your production line has to stop when someone is fixing, and there’s no good solution that is distributed. Even if you add tooling around it, you will face collisions, for instance because people will have competing merge revisions in different branches.

How could we make it better?

Faced with this problem, we had many solutions:

do nothing
write some tooling around the problem: we did it, partially: we had a command to automatically “chop” one of the two migrations and change its parent
ensure branches are merged sequentially and freshly rebased: GitHub lets you do that, and/or you can have a “merge queue”, but it only moves the problem to a different place; and you still have constant, manual, low-value work to do
use “upgrade heads” instead of “upgrade head”: with this Alembic won’t bother about ordering, but we still need to reconcile branches at some point or we would loose all ordering guarantees in the end
search for alternative migration systems
simplify the operational part

We strongly felt that we could simplify the whole thing and make this problem ✨disappear✨. So we decided to go for the last option!

Implementing improvements

Our ideal target is a migration system like Ruby on Rails, where:

1/ migrations are numbered with something that makes them roughly sorted naturally (i.e. we never run a migration that’s 2 years old before the ones of last week, but we don’t care much about the order of migrations introduced the same day/week);
2/ we get rid of the strict dependency between each migration;
3/ we store explicitly all the migrations that have been run;
4/ we modify the migration runner so it relies on “3” and runs only missing migrations, in the order we got from “1”

1/ Better numbering of migrations

Alembic provides a hook that allows you to customize the name and content of revisions that it generates. It’s documented at length in Alembic’s documentation, in “Cookbooks”.

We can use it to generate names with an inverted date scheme, instead of random strings:

def run_migrations_online():

    # ...

    def process_revision_directives(context, revision, directives):
        # 20210801211024 for a migration generated on Aug 1st, 2021 at 21:10:24
        rev_id = datetime.now().strftime("%Y%m%d%H%M%S")
        for directive in directives:
            directive.rev_id = rev_id



    # connectable = ...

    with connectable.connect() as connection:
        context.configure(
            connection=connection,
            target_metadata=target_metadata,
            process_revision_directives=process_revision_directives
        )

        with context.begin_transaction():
            context.run_migrations()

Now our migrations have a naming scheme that’s roughly sorted, e.g. 20211110161024_add_constraint.py.

2/ Loose dependencies between revisions

We want to relax the constraints on the “down_revision” property of a given revision. The main idea here is to introduce a dynamic function to compute the down revision. In all migrations from now on, instead of this:

revision = "556dbdeb7e81"
down_revision = "12acc27ebaa7" # hardcoded, not flexible

… we would have this instead:

revision = "20211110161024"
down_revision = find_previous_revision(revision)

This function would:

list all revisions in the directory, in order (memoized for performance reasons);
keep only the ones in the new format, aka 14 numeric characters;
return the one just before the one passed as argument.

Pretty simple right? Actually I won’t show any code here, because the code itself is not really interesting, and we ended up dropping the down_revision concept entirely because of the next steps.

3/ Storing all migrations and 4/ Running missing ones in order

At that point we had a working migration system that looked like the ideal one, for happy path cases. If migrations were inserted in order, everything was good and more dynamic. But we actually did all this work so we could insert migrations in the middle of the chain without worrying about it, and with no annoying synchronization needed.

It turns out Alembic only stores the last migration it has run in its reference alembic_version table. It doesn’t store all past migrations. So without changing this, we can’t introduce a migration in the middle of the chain.

Let’s imagine our main branch contains:

% ls migrations/revisions/
…
20211001160000_add_constraint.py # migration from October, 1st
20211130160000_add_new_table.py  # migration from November, 30th

And our current database has already applied the last migration from the 30th of November. Then if we merge a PR containing a migration generated on the 29th of November, it will be lexically before the last one, and Alembic has no clue that it has to apply this migration.

It was time to take a step back:

from the beginning we hoped to be able to hook into Alembic and make it work like we wanted;
but it turns out that the way it run migrations is fundamentally different from what we need, and there’s no public hook available for changing this behavior

In the end we do want to retain Alembic core features (transparent operations, integration with sqlalchemy, auto-generation of migrations, running migrations up and down, etc.). But we want to rework how it determines the ordering of scripts and how it stores its state in the database.

The full implementation is not super interesting as it relies a lot on Alembic complex internals, but we did basically 3 things:

1/ we created our own table to track migrations, called “schema_migrations” (we took inspiration from the Rails world, in particular this article by Sam Saffron)
2/ we provided Alembic with our own “revision_map” method, which is the way alembic determines the ordered tree of migrations to run
3/ we added a simpler “MigrationContext” subclass, with its own “run_migrations” method that determines which migrations should be executed by looking into “schema_migrations”, only reporting migrations that are NOT present in this table

Epilogue: transitioning to the new system

One day we asked engineers to avoid committing new migrations and transitioned to our new system. In practice we just had to declare all past migrations as already executed in our new “schema_migrations” table, and voilà!

We got the additional benefit that we now have a table that tracks when migrations were run on a given environment, and how much time they took, which proves to be useful at times!

Of course, the approach we took in this article is not a silver bullet:

it solved a big pain in our context, but you might be perfectly happy with standard Alembic in yours!
we had to patch some Alembic internals, so it makes the Alembic upgrades a bit more difficult and the system might require more investment at some point

Overall we never had a human synchronization issue because of conflicting migrations again. So we made our initial problem completely disappear, which perfectly aligns with how we like to approach engineering problems at Alan! 🚀