Zero downtime migrations with ActiveRecord

In this first post of Klaxit TechBlog I would like to talk about a very important topic for our team: how to constantly deliver new features to our users without any interruption of the service. It can be very disappointing for users that their favorite app doesn’t evolve as fast as their needs do, but having an ugly error message when they need it most may be even worse, and for us this is not a trade-off.

The biggest challenge we have found on deploying code with no downtime was about how to deal with model changes. Pedro Belo’s blog post helped a lot to understand the problem but doesn’t detail all cases, Paul Gross went further but didn’t give any code examples, then we realised there was still a lack of public sources on the subject.

Our CTO, Cyrille Courtière, has talked at Paris.rb’s monthly meetup describing how we implement zero downtime database migrations in our development process. This article is our second contribution to the community on this issue.

Backward compatibility

The basic principle that allows us avoiding downtime and having a simple deployment procedure at the same time is: any model change in the code should be compatible with the current database schema.

This way every deployment follows the same steps, whether it contains a model change or not:

  1. Configure environment variables
  2. Deploy code
  3. Run migrations

Patterns

Every different model change could imply a different downtime issue and requires a different pattern to be applied. New code must not break while running over the old database schema, and database migrations must not cause any downtime themselves.

Overview

  • Adding indexes: create indexes concurrently
  • Removing indexes: safe
  • Adding columns: check if the column has been created
  • Removing columns: ignore column from cache
  • Renaming columns: create new column, migrate data, remove old column
  • Adding tables: check if the table has been created
  • Removing tables: safe
  • Renaming tables: downtime

Adding indexes

Adding indexes to a column by default locks all rows in the table against writes, so no requests involving creation or update on this model can be processed until the migration finishes. To avoid downtime, indexes must be created concurrently:

Note that the migration must call disable_ddl_transaction! otherwise it runs within a transaction and the table will stay locked anyway.

Removing indexes

Safe, as removing indexes doesn’t require any lock.

Adding columns

Features using a newly created attribute on your model will break if they try to read or write into the column while it’s not created in the database yet. To make sure the code works even before the migration runs, we override attribute’s accessors adding safety guards.

Note that any other methods related to the new column such as new_column_changed? must also be overriden with the same kind of guards.

Adding non-nullable columns

It happens that we add a required attribute to a model, with default values for existing records. It must be made in three steps:

  • Add new column
  • Populate all rows with default value

Keep in mind that simply UPDATE all rows would cause lock against writes in the whole table, that is why update must de done by batches. The migration must not run within a transaction for the same reason.

  • Add NOT NULL constraint

ActiveRecord model validation on the new attribute must also check if the column is present in the database.

Removing columns

Tell ActiveRecord to ignore the removed column from its cache:

Renaming columns

To ensure no downtime on deploy, instead of simply renaming the column we do it in three steps:

  • Add new column
  • Migrate values from old column to new column
  • Remove old column

To make everything to work during all steps and so on we must keep reading from old column while it’s still there, start writing into new column when it’s created, then read from new column only when data is fully migrated and the old column doesn’t exist anymore.

Adding tables

Every time we add a new model in the application, we temporarily keep additional checks to ensure that no code will break before the table is really created in database.

Removing tables

Totally safe, as newly deployed code is not using that table.

Renaming tables

Renaming tables with zero downtime is a big challenge. We should create the new table, keep writing into both tables, migrate all data, remove old table, then start using new one. This procedure is very heavyweight and doesn’t give any guarantee of no database locks anyway, so that’s why this is the only migration we do with some downtime… but we don’t rename tables very often.

Conclusion

I hope this article could help you to avoid displaying maintenance pages to you users again. If you find an error in our procedure or have any other question, let us know by sending an e-mail to dev@klaxit.com

Like what you read? Give Felipe Batista a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.