How to change a column type in your production's PostgreSQL database

Gauthier François
Doctolib
Published in
7 min readNov 22, 2021

In this article, you will discover how we migrated our Int fields to BigInt from our PostgreSQL managed service using the logical replication.

PostgreSQL BigInt migration

The Doctolib application uses PostgreSQL as the main transactional database engine. This architecture is based on a primary server for writing and several replicas for reading. The total size of our database is several terabytes of data (and still growing!).

Many tables used an integer as a primary key, as this was the default type before Rails 6.

The limit of an integer in PostgreSQL is +2,147,483,647 for signed integers (32 bits). The current ID of the notification table was 1,882,824,827; at the rate we were writing to this table, we had about 90 days left before we’d reached that limit. At this point, it would no longer be possible to insert new records. Once this limit was reached, the Doctolib application would no longer function correctly (no more appointments, notifications, etc.).

The notifications table was the closest to expiration, but the appointments table was also problematic; its own deadline was estimated to be in 200 days. Same for all the tables that referenced appointments.

Therefore, we decided to migrate these primary keys from the “Integer” (Int) type to the “BigInteger” (BigInt) type. The aim was to carry out this migration with the minimum of downtime and above all without any loss of data.

One of the major challenges was to make this migration within the framework of a managed service.

Migration requirements

Our action plan was as follows:

  1. Imagine the different ways of carrying this migration out.
  2. To find a solution that would allow us to carry out this migration with the shortest possible downtime and without loss of data.
  3. Validate and test the end-to-end solution on our different environments. This is to confirm the proper behavior of the solution for the production migration.

At Doctolib, we have several environments that are very close to production. This is a considerable advantage for migration testing. In addition to testing the migration scenario, we were able to practice several times to ensure a smooth process with minimal operational impact.

However, the amount of data and usage led us to rethink our migration strategy. Indeed, the small dataset on our test environment, compared to production, made some of our migration plans impossible when applied within the constraints we had set ourselves. We had to generate a relevant dataset without too much volume so that the tests were correct but could be executed in a reasonable time

The maximum service interruption we could tolerate was 30 minutes

Migration strategy

All strategies requiring the “ALTER TABLE” command had to be discarded. Indeed, this command requires a “LOCK” on each table as long as the migration operation is not finished. This LOCK is exclusive and does not allow any more writing on the relations being modified. Consequently, the Doctolib application would become inaccessible, and moreover, for an indefinitely long period of time given the volume of data involved. We therefore had to use non-blocking strategies to get around this problem.

We therefore opted for a backup and restore solution.

Using a managed service is a great advantage in terms of usability, but can sometimes be a constraint. Indeed, some permission restrictions made the task a bit more complex. In our case, we didn’t have the possibility to disable the “TRIGGERS” on the tables. Here is the error we encountered:

After many tests, here is the summary of the path of our operation to modify the foreign keys from Int to BigInt :

  1. Initialize and configure a new managed database
  2. Preparation and backup of the database with a replication slot
  3. Restore the “pre-data” section of the backup
  4. Deleting views (caused by the triggers restriction access)
  5. Schema changes (from Int to BingInt)
  6. Restoring views
  7. Restore the “data” section of the backup
  8. Restore the “post-data” section of the backup
  9. Setting up logical replication

More information about the pre/post/data sections:

The data section contains actual table data as well as large-object definitions. Post-data items consist of definitions of indexes, triggers, rules and constraints other than validated check constraints. Pre-data items consist of all other data definition items. PostgreSQL.

Why use logical replication?

Logical replication is a method of replicating data objects and their changes, based upon their replication identity (usually a primary key). We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. (PostgreSQL supports both mechanisms concurrently). PostgreSQL

As you can see, the advantage is that you don’t have to worry about the data structure during the replication of the database. So in our case, we had a database with BigInt data types replicating our production database with the Int.

In order to clarify the terminology in the next part, I would like to specify the terms “publisher” and “subscriber”. The publisher was the cluster in production and the subscriber was the one that is replicated on the publisher. Next is a step by step guide to the migration:

The goal was to achieve this architecture for the migration:

Initialization and configuration of the new managed cluster

No big surprise for this step. Don’t forget that the user who will be used for logical replication must have full rights on the database and also be able to connect to the publisher.

Preparation and backup of the database with a replication slot

Prepping the publisher database mostly means we need to set up a publication.

A publication is essentially a group of tables whose data changes are intended to be replicated through logical replication. See Section 31.1 for details about how publications fit into the logical replication setup. PostgreSQL

At this point we still have no data in transit. So we will proceed to backup the database using a database dump.

Next, we create a replication slot which will be the reference for catching up on the data for the subscriber while we set the replication up. Please note that you must open a connection in replication mode to create a replication slot.

Warning:

  1. the snapshot_name can not be displayed afterwards. We will need it for restoration.
  2. you need to have an active connection (in replication mode) to make a dump with a snapshot name. You must keep the current connection opened.

Restoring the “pre-data” section

The goal here is to restore as little data as possible so that we can make our changes.

The “-j 8” option allows parallelization according to the number of CPU cores. I advise you to set this parameter to the number of CPU -2 (keep 2 cpu to not overload the server).

Deleting views

Deleting views is necessary because modifying them via “ALTER TABLE” requires specific permissions that we don’t have with our managed service.

Repeat the operation for each view and materialized view.

Schema modification

Here we are at the step that modifies the schema.

Repeat the operation for all the schema changes you want to make.

In our case, we took the opportunity to do this modification on many columns whose sequences were close to the Int type limit.

Restoring views

To restore the views, you must first extract them from the publisher. We factorized the export and the restoration of the views for convenience.

Exporting views :

Restoring views :

Restoring the “data” and “post-data” sections

Data restoration is the longest step (6TB+ in our case)

Setting up logical replication

Beware that once logical replication is in place, any schema modification (table creation, addition of a column, etc.) on the publisher must be transferred to the subscriber. Otherwise, the replication is paused.

Then, we can initiate the logical replication.

Still, in the context of our Rails application, we must not forget to also activate the migration mechanism on our new database. The best thing is to manage to block schema migrations between the beginning and the end of the logical replication setup to avoid conflicts when the replication becomes operational.

Migration

For each database intervention, we use our failover checklist to build our intervention plan; here for the migration of our database.

I’ll skip the pre-maintenance parts; PR preparation, configuration checks, access checks, etc. I’m going to concentrate on the “downtime” part. This is the most important part for us since we wanted to carry out this migration with a minimum of service interruption.

Below is the service interruption part of our intervention checklist:

1. Application maintenance (service interruption begins)

2. Stopping writes to the database
Since we can’t stop (in case of backtracking), we prevented the production user from connecting to the database and cut the connections already established

3. Copy the sequences; we realized that copying the sequences was necessary because logical replication after dump/restore does not do this.

4. Configure the application with the new database

5. Verification of the correct operation of the applications with the new database

6. Application maintenance output

7. Do the cleaning

In the end, we successfully completed this migration with less than 5 minutes of downtime.

Did you like this article? Subscribe to our newsletter to learn more!

--

--

Gauthier François
Doctolib

Pragmatic SRE, production alerts killer, legacy maintainer, operational excellence advocate.