Safely Running Database Migrations

Cássio Marques
multiverse-tech
Published in
11 min readMar 22, 2023

Many migration operations can be destructive, meaning that they change or remove certain structural database characteristics, which can lead to data inconsistency, application runtime errors and database/application performance degradation. We should always do our best to ensure that database migrations do not cause downtime or errors that might affect end users.

Introduction

This post is an attempt to cover some of the more problematic operations, with techniques that can help limit the impact of running migrations on production databases.

This post assumes you’re using a recent version of PostgreSQL. It also assumes Ecto migrations are used, although most of the information here is programming language/library agnostic.

Changing Data via a Migration

JUST DON’T.

Seriously, don’t change data as part of a migration. In general, migrations should never directly execute SQL statements such as insert, update or delete. Migrations should be limited to changing a database structure, never its data. Any data changes should be performed as an individual step, via a (preferably testable) script.

Renaming Columns

Renaming columns has the potential of causing downtime or an increased error rate in your application. You’ll usually need to consider the following things before deciding how to proceed:

  • How frequently is the affected table used?
  • Do you understand how many users could potentially be affected by this change in case of errors?
  • Would it be safer to make this change outside working hours?

When considering the points above you’ll need to decide whether some downtime or an increased error rate is acceptable. If it is, you can create a single pull request containing the migration that renames the column and the code updates and deploy it. Make sure you notify other engineers of the potential for errors when the new code is deployed.

If you decide that downtime or errors are not acceptable, you’ll need to use a more complex approach, performing the renaming in steps.

  1. Add a new column with the desired name and with the same data type and equivalent indexes/constraints as the column you want to replace.
  2. Create a mechanism that will copy the code value from the old column to the new one when it changes. This can be done in one of the following ways:
    – If there is a small number of places where write operations to the table in question happen and you can reliably track where those are, add code that keeps the two columns in sync.
    – If there are multiple entry points for write operations to the table in question and you are unsure where they are, create a trigger that will copy the value from the old column to the new one for every record that gets inserted or updated, or
  3. Update all the existing rows so that the new column has the same values as the column you want to replace.
  4. Update your code to use the new column name
  5. Once your code is no longer using the old column, drop the trigger.
  6. Delete the old column.

Trigger example

The problem with triggers is that they are harder to test and can lead to some unexpected results. An approach that can give us a bit more control over the changes executed by triggers is to actually use two separate functions/triggers, one for inserts and another for updates.

Given a users table containing a surname column that needs to be renamed to last_name, we would have something like the following:

defmodule MyApp.Repo.Migrations.AddLastNameToUsers do
use Ecto.Migration

def up do
execute """
create or replace function users_copy_last_name_from_surname() returns trigger language plpgsql as
$$
begin
NEW.last_name = NEW.surname;
return NEW;
end;
$$;
"""
alter table("users"), do: add(:last_name, :string)
execute """
create trigger before_insert_trigger_on_users_last_name
before insert on users
for each row execute procedure users_copy_last_name_from_surname();
"""
# This ensures that updates changing columns other than `last_name` won't
# execute the trigger, thus avoiding accidentally overwriting values in `new_name`
execute """
create trigger before_update_trigger_on_users_last_name
before update of last_name on users
for each row execute procedure users_copy_last_name_from_surname;
"""
end

def down do
alter table("users"), do: remove(:last_name)
execute "drop trigger before_insert_trigger_on_users_last_name on users"
execute "drop trigger before_update_trigger_on_users_last_name on users"
execute "drop function users_copy_last_name_from_surname"
end
end

IMPORTANT: Bear in mind that an alter table statement will acquire an ACCESS EXCLUSIVE lock on the table being altered. Depending on the size of the table, the indexes in that table, etc, we might want to lazily backfill the new column instead of doing so with a statement like alter table ... using .... This can be achieved via a script that issues UPDATE statements for each existing row.

Once your code has been updated to no longer use the old column and your application has been deployed to reflect those changes, do:

defmodule MyApp.Repo.Migrations.RemoveSurnameFromUsers do
use Ecto.Migration

def up do
alter table("users"), do: remove(:surname)
execute "drop trigger before_trigger_on_users_last_name on users"
execute "drop function users_copy_last_name_from_surname"
end

def down do
execute """
create or replace function users_copy_last_name_from_surname() returns trigger language plpgsql as
$$
begin
NEW.last_name = NEW.surname;

return NEW;
end;
$$;
"""

alter table("users"), do: add(:surname, :string)

execute """
create trigger before_trigger_on_users_last_name
before insert or update on users
for each row execute procedure users_copy_last_name_from_surname();
"""
end
end

IMPORTANT: Make sure you can REALLY delete that column. Once it’s gone, reverting the migration won’t make much difference since you can recreate the column but the data is now gone, you will need to copy the data over from the new column!

Removing Columns

Before removing a column, you’ll need to do a few things:

  1. Check if there’s still code using that column by searching across the whole project and updating the code as needed to remove references to that column name. Create and merge a PR with any needed changes.
  2. Now that the obvious places where the column was used have been removed, write a migration that removes the column, run it and then run the tests. Are there any failures? You might have to repeat step 1.
  3. We can’t trust tests to cover all possible scenarios, so it might be a good idea to run the application locally and do a bit of manual testing, focusing on the areas where you know the tables being changed are used.

If you run into problems trying to remove the column because it’s referenced by a foreign key constraint on another table or something similar, you might consider using drop ... cascade. There's no support for it in Ecto migrations, so you'll need to use execute:

execute "alter table foo drop column bar cascade"

When using drop ... cascade make sure you check all constraints that were dropped locally and think about whether or not having those constraints is OK from a data consistency perspective. In some cases, you might find out that you can't actually remove that column.

Adding not null constraints

If the column you want to change already exists in the production database, you’ll need to check if there aren’t any existing rows where that column is currently null. If there are, the migration will fail once deployed to production, so you'll need to:

  1. Change the application so ensure that all new records inserted in that table will have a non-null value for the column in question.
  2. Prepare the existing data before you can apply the change.

For the first step, you’ll need to change your application code so that it’s guaranteed that all new records will have a valid value for the column in question. This can be done via one of the following approaches:

  • Using an intermediary migration that sets a default value for the column (in case all new records can share the same default value)
  • Adding logic that correctly defines the value of the column before the record is inserted/updated.

On top of the above options, you might want to add validations to your changesets, ensuring that attempts to insert records containing a null for that column will fail.

To prepare the existent data, we need to backfill all records currently with a null as the value for the column we want to change. You should not do that as part of the migration itself, since migrations should only be used to manipulate the structure of the database, not its data. Instead, create a one-off script to fix the data. On top of being decoupled from the migration, doing that allows for the script's code to be tested.

Once these steps have been performed and the new version of the application code has been deployed, you can proceed and create a new migration to add the not null constraint. In Ecto, the modify migration command can be used:

alter table(:foo) do
modify :bar, :string, null: false
end

One caveat of the above is that you’ll need to be careful to not override certain properties of the column. For example, consider a products table containing a price column with type numeric(5,2). If you were to use something like modify :price, :decimal, null: false, you would accidentally change the column type to just numeric (without specific values for precision and scale). Make sure that when adding the not null constraint via modify you repeat all column modifiers specified when the column was originally created.

Adding foreign key constraints

A foreign key constraint adds a relationship between two tables. For example, consider an orders and a customers tables.

-- Orders
+-------------+--------+------------------------------------------------------+
| Column | Type | Modifiers |
|-------------+--------+------------------------------------------------------|
| id | bigint | not null default nextval('orders_id_seq'::regclass) |
| customer_id | bigint | |
+-------------+--------+------------------------------------------------------+

-- Customers
+----------+------------------------+---------------------------------------------------------+
| Column | Type | Modifiers |
|----------+------------------------+---------------------------------------------------------|
| id | bigint | not null default nextval('customers_id_seq'::regclass) |
| name | character varying(255) | not null |
| email | character varying(255) | not null |
+----------+------------------------+---------------------------------------------------------+

The orders table has a customer_id column that references the id column of the customers table. A foreign key constraint is useful to ensure that no record can be inserted into the orders table with a customer_id that does not exist in the customers table.

If the column you want to change already exists in the production database, you’ll need to check if there aren’t any existing rows where that column is currently null or if there aren't any existing rows where that column has a value that doesn't actually exist in the referenced table/column.

Using the orders and customers tables as examples, such queries would look like this:

select count(1) from orders where customer_id is null;

select o.* from orders o left join customers c on o.customer_id = c.id where c.id is null;

If either of the above queries (of course adapted to your particular scenario!) return a non-zero result, you’ll need to:

  1. Find out why records have been created where the column in question is either null or has a value that does not exist in the referenced table and fix the code accordingly.
  2. Fix the existing production data before deploying your migration to production. This can be done via a script. Do not use the migration that adds the foreign key constraint to also fix the data, migrations should only be used to manipulate the structure of the database, not its data.

Once these steps have been performed and the new version of the application has been deployed (in case code changes were needed), it should be safe to deploy the migration that adds the foreign key constraint.

Using NOT VALID to Add Constraints Without Locking the Table

When adding constraints an alter table statement is used, which will usually acquire an ACCESS EXCLUSIVE lock on the whole table being changed. That means that all connections trying to access that table will be blocked and put in a queue until the alter table is completed. This is needed because the database engine will need to scan the whole table to check if the constraint can be satisfied for all existent records. This can cause timeouts or even bring your application down in case all connections in the connection pool end up being blocked at the same time.

One way to avoid the table being locked whilst a constraint is applied is to use the not valid option. When using not valid the addition of the new constraint will be committed immediately and there is no table scan to check that existent records don't violate the new constraint. You can then run any tasks needed to amend the existent data, making sure the constraint is not violated.

Ecto does not support the not valid option directly, so you'll need to use execute:

execute "alter table orders add constraint orders_customer_id_fkey foreign key (customer_id) references customers (id) not valid"

If after amending the data you still want to double-check that no invalid records still exist, you can re-run your SQL checks. There’s also the option of using a validate constraint statement against the database, which will validate all existent records but without needing an exclusive lock (because all new records inserted after the constraint has been added are already being validated, so only older records need to be validated). To use validate constraint you'll need to create a new migration with something like this:

execute "alter table orders validate constraint orders_customer_id_fkey"

One issue with the above is that, whilst there’s no risk of downtime when checking the constraint, your migration can still fail if there are still invalid records, which means that your deployment will fail. However, doing so is probably still worth it, since by this point you already fixed any data inconsistencies via a script and we would be doing a sanity check to ensure that all data is now correct.

Creating Indexes

Creating indexes is usually an expensive and time-consuming operation. Normal index creation in PostgreSQL will lock the table for any write operations (insert, update, delete) until the index creation completes. Whilst read operations against that table can still happen, any transactions attempting to modify data in that table will be blocked. This can cause timeouts and/or application downtime.

It’s possible to create indexes in PostgreSQL without locking the table, through the use of the concurrently option of the create index statement. Using this option will allow for other operations to continue while the index is built, however creating the index will take longer than the conventional non-concurrent index creation since it involves more table scans.

Regardless of whether an index is created concurrently or not, there will be an increase in I/O and CPU in the database server, so other operations might be affected and general performance degradation might be observed.

Ecto supports concurrent index creation natively, however, DDL transactions need to be disabled.

An example migration (shamelessly stolen from the Ecto docs) would look like:

defmodule MyRepo.Migrations.CreateIndexes do
use Ecto.Migration
@disable_ddl_transaction true
@disable_migration_lock true

def change do
create index("posts", [:slug], concurrently: true)
end
end

unique Indexes

With unique indexes the same value (or values in the case of composed indexes) can't be reused in more than one record. This means that a unique index also acts as a type of constraint, so you'll need to verify the state of the existing data in production before trying to apply a migration creating the unique index.

As an example, consider a table users containing a username column. To verify that existent usernames are unique, the following SQL query could be used:

select username, count(*) from users group by username having count(*) > 1;

If the query returns no results (if there are no duplicates) or a list of the duplicate values and how many times they are being reused.

If there are duplicated values across multiple records, you’ll first need to amend the data to ensure uniqueness prior to deploying the migration that adds a unique index. You can do that via a script.

Conclusion

Taking the required steps to safely run destructive migrations on a live system usually means having to do a bit more work to ensure that everything keeps working correctly whilst changes are being applied. As it's usually the case with any type of code we write, the techniques proposed in this post require proper testing before being deployed, ideally with data sets that are as similar as possible to the ones in production.

It's an extra cost that is well worth paying as it allows us to safely evolve our database schemas without affecting our end users.

--

--