Stop worrying about PostgreSQL locks in your Rails migrations

Published in

Doctolib

10 min readDec 14, 2020

Doctolib applications are used by more than 135 000 medical practitioners in France and Germany. They receive more than 70 million patient visits per month. At peak time, our Postgres database handles more than 15 000 requests per second. Acting in the healthcare field, reliability is critical for us. A downtime directly impacts the access to healthcare for practitioners and patients.

At Doctolib, we write a lot of migrations in our Rails monolith: during the first half of the year 2020, 13 000 pull-requests were merged, in which more than 500 migrations were created and executed against our production database.

The development team also increased to 100 full stack developers. One of our most important principles to handle this quick scale is “keep the stack simple”: we limit the number of technologies that we use, and we stay as close as possible to the standards.

About 2 years ago, we had our first downtime due to a migration, we added a column without any constraint on a relatively small table (less than 100k rows), which looks pretty safe, and the platform went down for a whole minute. From this moment, we started digging into Postgres locking system to prevent any incident of this kind to happen again.

In this article we will go through the most common pitfalls that could happen when migrating a database, and how, at Doctolib, we mitigate them.

This article was co-written with Romain Choquet.

Making a migration safe can be daunting

Migrations are manipulations of the database structure: they often add or remove columns, modify indexes or create new constraints. Often, due to Postgres concurrency model, these manipulations cannot be executed concurrently with other requests. For example, it is not possible to read on a table while adding a new column: PostgreSQL will wait for one operation to finish before executing the next one.

In the context of a high-load database, not being able to use a specific table for more than a few seconds can be catastrophic: all requests will start queuing and it may put the whole platform down.

Let’s look at a few examples of Rails migrations, the kind of locking they require, and how to mitigate their impact.

Adding an index

Adding an index is done quite frequently: over the past 6 months, 10% of the migrations made at Doctolib were creating an index.

The default way of doing it is pretty straightforward. According to the documentation we would write it as follows:

The add_index method translates to a CREATE INDEX statement in Postgres. If we take a look at the locking documentation we see that this type of statement requires a SHARE lock, which will prevent any write to that table while the index is being created.

Most of the time we cannot afford to prevent writing on a table for more than a few seconds, and an index creation is likely to require far more to complete. Fortunately since PostgreSQL 8.2, indexes can be created concurrently, meaning they will not block write operations.

Let’s look at how we could write this migration in a safer way:

In this new migration, we start by disabling the transaction used by default with Rails (for more details, have a look at the Rails documentation), as PostgreSQL does not allow concurrent index creation in transaction. On top of adding the concurrent algorithm option, we also had to unset the statement timeout as it is highly possible that the index creation will not succeed within the period allocated by the default application timeout. We then restore the statement timeout to avoid impacting any subsequent migrations.

While we sacrificed a bit of its simplicity, we can now execute this migration without impacting our users.

Adding a foreign key

Adding a foreign key can be done in a single line:

If executed on Doctolib’s production database, this migration would cause a downtime. The ADD FOREIGN KEY operation requires a SHARE ROW EXCLUSIVE lock, preventing writing on both appointments and patients tables while the migration is running.

An obvious mitigation would make use of the PostgreSQL timeout settings as we did in the previous example:

This migration will still lock both tables, but only up to five seconds. However, it has a substantial drawback: PostgreSQL will need to validate that all existing rows in the appointments table are valid by scanning the whole table. This operation is likely to take more than 5 seconds to complete, and will eventually timeout.

Yet again, PostgreSQL provides us with interesting options to solve this issue:

add the constraint without validation: it will only consider rows added after the migration as valid, but it will not scan the whole table to validate previously existing rows;
validate the constraint: it will validate previously existing rows in the table, without blocking read or write.

Let’s look at how we could implement this idea:

The statement timeout management gets a bit complex, but the main idea remains the same: we ensure that blocking statements are executed with a short statement timeout, and that non-blocking statements are executed without any.

Let’s look at an example that is a bit more unintuitive.

Adding a column

Once again, the vanilla version of this migration is straightforward:

A similar migration was run against our database and brought the application down for a minute. We were already aware of some pitfalls when adding columns with default value or not null constraint, but none where adding a simple column could block an entire table for such a long time.

As per the documentation, adding a column to a table requires an ACCESS EXCLUSIVE lock: it will prevent any access to the table while the statement is being executed, but this should be fine as this kind of statement usually takes a few milliseconds to execute. But the less obvious part is that this statement will also block any new queries while it is waiting to be executed.

Let’s look at what happened from the PostgreSQL perspective when we executed this statement on our production database. The following graph represents the query flows on the database: at time a we submit the migration, and it finishes at time b.

As we can see, no query will be able to start on the target table between a and b, as the add_column statement is either waiting to be executed or is being executed. It is likely to make all web servers busy, waiting for the database, and to cause a downtime.

Contrary to index or foreign key creation, we cannot avoid this situation: we can only mitigate its impact. Our best option here is to add a lock timeout, which will abort the migration if the lock is not acquired after a certain amount of time:

The execution flow now becomes:

We see now that the add_column statement will still block the table but for a much shorter duration. One obvious drawback is that this migration will fail, and we will have to manually retry it.

Note:
If ACCESS EXCLUSIVE is the strongest lock and prohibits any access to the table, other locks also need to be taken into account. For instance, the SHARE lock (used while adding a foreign key) will prevent any write operation while the migration is pending and executing.
For this reason, it is useful to set a lock timeout for almost any schema change.

Let’s look at one last migration.

Adding a reference

Adding a reference encompasses the principles that we saw previously. The naive version looks like this:

However, it falls into every pitfall that we mentioned earlier. Under the hood, add_reference is only syntactic sugar around add_column, add_index and add_foreign_key. It has the following issues:

Adding the new column is susceptible to exclusively lock the appointments table;
Adding the index is not done concurrently, and thus locks the appointments table;
Adding the reference will lock any write operation on both tables.

Here is a safer version of this migration:

This migration is now safe, but it diverged from the original idea of a migration, as described by the official documentation:

Migrations are a convenient way to alter your database schema over time in a consistent way. They use a Ruby DSL so that you don’t have to write SQL by hand, allowing your schema and changes to be database independent.
Rails guide: Active record migrations

It is not convenient: it is rather complex, error prone, and requires a fairly deep knowledge of PostgreSQL locking mechanisms.
It does not use the ActiveRecord DSL, which makes it more complex to read and database dependent.

Given the number of migrations that we write every day, it is safe to assume that mistakes will eventually happen, even if we managed to train everyone.

That’s how we came up with the safe-pg-migrations gem. It does the heavy lifting when it comes to making a migration safe, and allows developers to focus on delivering value to our end users.

safe-pg-migrations

Our first goal is to reduce the risk of a migration causing downtime from likely to very-unlikely while not complexifying the process of migration for developers.

Some gems addressing the first point already exist, the most popular one being strong migrations. It acts as a safety belt at development time, reminding developers that migration can be dangerous and provides guidance to make them safe.

We have been using this gem for a long time, but on some occasions it failed at protecting us from downtime, especially for some of the more subtle locking scenarios described above. It also requires us to write additional code to make migrations safe, making it error-prone.

To try to achieve a “safe by default” approach, safe-pg-migrations hooks into ActiveRecord migrations and modifies their default behavior.

Just by adding this gem to our repository, we can go back to writing vanilla-looking migrations. Let’s look at this index creation:

As we saw previously, making this migration safe required some boilerplate, but if we run db:migrate and look at the logs, we can notice that the gem automatically did the work for us:

You can see that the algorithm: :concurrently option was automatically added, there is no statement timeout, and the lock timeout is high enough so the index creation can wait a bit to acquire the lock.

A more complex case would be an add_reference, with foreign key and index. We saw that securing this operation requires some effort without proper tooling, with safe-pg-migrations we can go back to a vanilla implementation:

And if we look at the output SQL statement for this migration:

We can see that the resulting SQL is close to what we wrote by ourselves when we made this migration safe earlier in this article.

One of the obvious drawbacks of having low timeouts is that migrations are more likely to fail if the database is too busy. To maximize the success rate of our migrations, the gem also comes with a retry mechanism allowing a failed statement to be automatically replayed if it did not manage to acquire the required lock.

Let’s look at what adding a reference on a busy database could look like, for demonstration purposes we simulate a long running query by keeping a read transaction open (right terminal), preventing the patient_id column from being added on the appointments table

We see that the safe-pg-migrations catches the lock timeout, displays the possible culprit query, and retries a few seconds later. Once we commit the open transaction (releasing the lock held by the SELECT statement), the next retry manages to acquire the exclusive lock on the appointments table and the migration succeeds.

This migration was executed with the following settings:

SafePgMigrations.config.safe_timeout = 2.seconds
SafePgMigrations.config.retry_delay = 5.seconds

If you want to try it by yourselves, you just have to drop gem 'safe-pg-migrations'in your Gemfile, you can then execute rails db:migrate after exporting SAFE_PG_MIGRATIONS_VERBOSE=1to inspect the issued SQL statements.

You can also have a look at the complete feature set here.

So far, safe-pg-migrations allowed us to be more confident about our database migrations, one does not need an experienced reviewer to validate that a migration is safe to be executed on our production database.

However, we did not solve the issue entirely and there is still room for improvement.

Limitations

Our database is serving more and more requests, meaning that locks are more frequent and our migrations get more and more likely to fail because they do not manage to acquire their lock in time.

While we try to keep long-running transactions (and thus: lock) as rare as possible, we still suffer from a fair amount of unpredictability. It is particularly true if we consider Postgres mandatory maintenance operations (VACUUM) in the equation. When such operations run on a table, any migration trying to alter this table will fail. And as this kind of operation can last hours on large tables, we sometimes have to resort to manual operations to skip the migration and retry it later.

Thanks for reading!

Thanks for making it this far, if you are interested in this project, we are open for contributions, so if you happen to try the gem and find improvements, feel free to contribute by proposing issues or pull requests, we will gladly review them!

If you want more technical news, follow our journey through our docto-tech-life newsletter.

And if you want to join us scaling a high-traffic website and transform the healthcare system, we are also hiring talented developers to grow our tech and product team in France and Germany, feel free to have a look at the open positions.

Many thanks to Matthieu Prat who initiated the work in this field and authored this gem.