Pulling off zero-downtime PostgreSQL migrations with Bucardo and Terraform

Published in

Getsafe

12 min readJun 17, 2024

At Getsafe, we’ve been running our web services and primary databases in Heroku from the start, but we recently decided to move to AWS + Qovery for cost savings and increased flexibility. We run a service-oriented architecture, so each service needs to be migrated individually.

With the help of the Qovery team, we came up with a way to migrate our services to AWS. However, monitoring of our pilot showed a significant increase in latency (around 100ms) between our services in AWS and their databases still on Heroku. This wasn’t directly due to the different providers, but rather because our AWS services were running in a different cloud region (unavailable on Heroku); every request needed longer network trips. We immediately halted the migration of more services, to prevent a scenario of cascading latency increases leading to an overall degraded UX.

This spurred us to pivot to our plan to migrate our databases from Heroku to Amazon RDS. This was compounded by the fact that some of our databases on Heroku were running on outdated PostgreSQL versions and Heroku had informed us we needed to upgrade by a certain date.

Rather than upgrade on Heroku (a process which would involve some downtime and planning) and later migrate to RDS (which would potentially also need downtime), we decided to do it in one shot: migrate our databases from their old versions on Heroku to new versions on RDS.

Finding a technical solution

Migrations are complex projects, but the first step was deciding on a way to actually do this. The most basic way would be good ol’ pg_dump and pg_restore — turn off the app, dump the old database, dump it, then import into the new database, switch the app to point to the new database, and all done. This could work, since you can dump from an older version and restore into a newer version. However, we didn’t like this option because we had some big databases (several hundred gigabytes), and it would likely mean more downtime than was acceptable.

We took a look at first-class Heroku support for migrating databases, but were disappointed. AWS Database Migration Service, probably the easiest and most reliable way to do this, does not work on Heroku, as Heroku does not provide users with superuser access to their database.

This means that Heroku Postgres does not support external logical replication, although they’re apparently working on it.

Aside: What’s Logical Replication?

Logical replication is a form of copying data from one database to another by copying the logical units (ie rows, tables, sequences, etc). Compare this with physical replication (what you do with pg_dump/pg_restore), where you simply copy and restore the actual contents of the disk (which involves downtime, since you can’t copy a disk partially without corrupting the data). Generally, logical replication will be slower than physical replication, and can also degrade database performance while it runs, but has the advantage of not requiring any downtime and being more flexible.

After contacting Heroku support, they offered us an alternative: they would give us access to the writeahead logs of our database, and we could use wal-g to set up a replica somewhere to ingest these logs, which we could then replicate to RDS. However, this approach indicated a level of complexity that we were not comfortable with, so we kept considering other options.

And then Anar brought Bucardo to our attention. It’s an open-source tool for live logical replication of one PostgreSQL database to another, with support for two-way replication, replicating specific tables, replicating within the same database, transforming the data as you replicate, and more.

I set up a proof of concept with some services in staging, and, after some fiddling, it worked. It looked like a viable solution, so we decided to go with it.

Four months later, we’ve successfully migrated all our production databases to RDS, with the majority of them not needing any downtime. How did we do it, and what did we learn?

Bucardo

At its core, this is how you use Bucardo:

# Set up Bucardo's internal Postgres database
bucardo install

# Provide the database configurations
bucardo add db my_service_heroku host=... port=... dbname=... user=... conn="sslmode=require"
bucardo add db my_service_rds host=... port=... dbname=... user=... conn="sslmode=require"

# Specify the list of tables and sequences to be replicated
bucardo add all tables db=my_service_heroku --relgroup=my_service
bucardo add all sequences db=my_service_heroku --relgroup=my_service

# Configure the sync in Bucardo. The parameter `onetimecopy=1` tells Bucardo to 
# first copy all current data from old to new, before going into live replication mode.
bucardo add sync my_service relgroup=my_service dbs=my_service_heroku,my_service_rds onetimecopy=1

# Start the sync as a background process
bucardo start

With this, Bucardo will copy the current data from old to new, then start live replicating any changes. And when you’re ready, you can then switch your app to point to the new database.

Of course, there’s still a lot of work involved. But first, it’s useful to know how Bucardo works.

There are two key details:

When you run bucardo add sync, Bucardo adds triggers to your old database. These triggers will be invoked when a row is inserted, updated, or deleted. The trigger function notifies the Bucardo process of the changes, and it then applies this change to the new database.
For the initial copy (onetimecopy=1), Bucardo uses Postgres’ COPY to copy the initial data from the old database to the new. It runs COPY TO STDOUT to copy from the old database to standard output, then streams that to standard input of the new database, importing the data using COPY FROM STDIN .

Note that Bucardo copies data into an existing schema. It does not create tables, so you’ll need to manually set up the schema on the new database beforehand. Every table must also have a primary key (this is how the live replication can compare old and new to determine changes in a row).

Infrastructure setup

The next important detail: how/where to set this up. Bucardo can be run anywhere, as long as it is able to communicate with both databases over the network. In our case, the database on Heroku was accessible over the Internet, while the RDS database was in a private network on AWS. We decided to run Bucardo on an EC2 instance (in a public network, because we wanted to be able to access it via SSH). To connect to the RDS database, we made a peering connection between its public network and the private network. Overall, the setup functioned like this:

Terraform

Once we had figured out the infra we needed, we focused on automating it with Terraform. Terraform is a tool for managing infrastructure as code (IaC), which we had recently adopted for managing our AWS infrastructure. Even though this seemed like a one-off project, considering that it would take a lot of planning and testing, not to mention repeating it for each service, it was a no-brainer to use Terraform for this as well.

Terraform brought us several advantages here:

Transience: With a single command, we could spin up or destroy a Bucardo (EC2) or RDS instance. This made do-overs easy when experimenting or in the case of mistakes.

Reproducibility: When you spun up a new instance, everything “just worked”, rather than having to go to the AWS console to figure it out.

Automation: On each new instance, we preconfigured Bucardo and RDS as needed, reducing the chance of human mistakes. The job of the developer was primarily to log in to the server, start the sync, and keep an eye on things.

Templates: An underrated feature of Terraform, templates allowed us to generate the needed commands for the developer to copy and paste, primarily useful for debugging, or the places where manual intervention was needed. For instance, any time you set up a Bucardo instance for a service, we configured Terraform to generate two files:

start_instructions.my_service.sh, which contained the instructions for the developer to run to kick off the sync, and commands to monitor/debug.
cleanup_instructions.my_service.sh, commands to undo any Bucardo effects, in the event that something went wrong, and you wanted to restart. Also used to clean up when the migration was done.

Putting it all together, here’s what our Bucardo + Terraform setup ultimately looked like (if you aren’t interested in Terraform, the key Bucardo instructions are in the scripts/ folder):

GitHub - nxt-insurance/bucardo-terraform-archive: Bucardo + Terraform setup we used for our Heroku…

Bucardo + Terraform setup we used for our Heroku to RDS migration - nxt-insurance/bucardo-terraform-archive

github.com

Technical challenges

Before we figured out the overall migration, we had to deal with several technical challenges:

Bucardo status reporting

Bucardo is really bad at reporting its status. We had a few scares because it looked like something was wrong where nothing actually was, and we lost some time because something was wrong and we didn’t know. At some point I had to dig through the source code (it’s Perl, but surprisingly easy to follow). Here’s a guide I made on following Bucardo’s progress.

Copying large databases

During testing, we found that our largest databases were taking several days to copy. This was unacceptable. After some investigating, I figured out the problem: indexes. Removing the indexes in the new database before copying sped things up by over a 100x (most COPYs now finished in under an hour).

This shouldn’t come as a surprise if you’re familiar enough with databases. It made sense to me in retrospect, but getting there was an investigative adventure, which I’ve documented. I also got to learn a bit more about monitoring database performance.

Schema imports

Importing the schema from the old database to the new took some trial and error.

pg_dump --schema-only is the way to do this, but you can’t restore that with pg_restore, so you need to use psql -f schema.sql to execute the dumped schema statements.
pg_dump dumps all schemas in the database, including some Heroku-specific ones. We only wanted the default schema (public), so that became -n public. However, this meant that extensions weren’t dumped. We needed these extensions (such as pgcrypto for generating UUIDs); the fix for this was specifying — extension=* .
Unfortunately, some of our databases were on much older versions where this flag was not supported. So we switched to using psql to list all extensions and generate a script to install them.

Our final dump/import script looked like this:

psql "host=${heroku_host} port=5432 dbname=${heroku_database} user=${heroku_username}" \
  -c "select format('CREATE EXTENSION %s;', extname) from pg_extension" \
  --tuples-only -o ${service_name}_schema_extensions.sql
pg_dump "host=${heroku_host} port=5432 dbname=${heroku_database} user=${heroku_username}" \
  -n public --schema-only > ${service_name}_schema.sql

# We'll drop all indexes temporarily, as they slow down the initial copy significantly. We'll add them later.
grep -oP 'CREATE (UNIQUE )?INDEX \K[^ ]+' ${service_name}_schema.sql \
  |  awk '{print "DROP INDEX " $1 ";"}' > ${service_name}_temporarily_drop_indexes.sql
# This file will be manually executed later, when the copy is done, to import indexes
grep -E '^CREATE (UNIQUE )?INDEX' ${service_name}_schema.sql > ${service_name}_add_indexes.sql

# Import extensions and schema
psql "host=${rds_host} port=5432 dbname=${rds_database} user=${rds_bucardo_username}" \
  -f ${service_name}_schema_extensions.sql
psql "host=${rds_host} port=5432 dbname=${rds_database} user=${rds_bucardo_username}" \
  -f ${service_name}_schema.sql

PostgreSQL permissions hell

We spent a surprising amount of time on Postgres privileges. We wanted to set up proper database user accounts in RDS with the right permissions (one account for the service to read/write/make schema changes, and one account for our data warehousing pipeline to read from the replica). PostgreSQL kept serving up surprises at every turn. For instance:

GRANT USAGE ON SCHEMA public TO this_userdoes not actually give this_userthe ability to query that schema. Neither does GRANT ALL PRIVILEGES ON DATABASE the_database .
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public will allow the user to query the schema, but only for tables which currently exist (for that you need ALTER DEFAULT PRIVILEGES IN SCHEMA). It will also not allow creating new tables.
It’s really difficult to give a user full permissions on a schema. Your best option is to make them the OWNER.
Even though your current user account is a superuser, you may still be unable to grant some privileges to others. For example, running ALTER DEFAULT PRIVILEGES IN SCHEMAwill not unless you are the owner of the schema. And Postgres does not report an error! In fact, in many of these cases, Postgres gives a misleading error message or none at all.

Terraform/AWS

AWS is super complex, and using Terraform makes it more obvious, as you must create and manage each resource yourself. This is in contrast to the AWS UI, where some resources are auto-created for you and defaults are set. It’s made worse by the fact that you must tread carefully to avoid creating a security hole. Additionally, for all of its amazing-ness, there are some things Terraform still does not support, or some things that AWS does not let you manage.

A task like “I want one machine to talk to another” took me several frustrated hours of configuring routes, route tables, network ACLs, ingress and egress rules, peering connections, and gateways. AWS Reachability Analyzer was helpful at a point in figuring out what I was missing.

There were of course other minor challenges, such as pg_dumpversion mismatches — mostly fixed by making sure to apt installthe version matching the major version on Heroku, and funny issues with Postgres passwords (tip: don’t use a ‘:’ in your PostgreSQL password).

Switching databases: planning, downtime, maintenance mode and communication

Once we had figured out how the migration would work, we began to plan the execution. Since Bucardo would copy data and then switch to live replication, we only had to focus on when to switch our service from the old to the new database. In an ideal world, you can do this at any time, and everything should work smoothly with no downtime. However, we needed to account for the replication lag.

Replication lag in this case meant that any database writes to the old database while we were switching from old to new might take a while to be replicated to the new database. And depending on the dependencies of that database, this might have wider consequences.

Replication lag is especially a serious concern if you use sequential IDs: if an item with ID 10 is created in the old database and it isn’t yet replicated before the new database creates a different item with ID 10, you have a conflict. Bucardo allows you to configure how to handle conflicts, but you really want to avoid this situation. We didn’t have this concern, though, as we use UUIDs for most of our primary keys.

Still, we got hit by the replication lag in one of our early migrations. After the switch, we had about 15 minutes of “record not found” errors (meaning a record had been created in the old database that was not yet in the new). It all got caught up eventually, but we stepped back to plan better.

We came up with a two-tier strategy:

For Tier 1 services (low write traffic to the database), minimal replication lag was expected. We would only turn off background jobs, make the switch, and then turn them back on.
For Tier 2 services (heavy write traffic), we would simply stop the service and restart it. There would be downtime, but not more than a few minutes.

For the cases where downtime was needed, we had to set up proper communication channels—not just to the other developers, but customer support, claims handlers, and management. For each service, we would pick a time where there was low traffic, estimate the maximum duration of downtime expected and what functionality might be affected, and communicate this ahead of time.

Additionally, we introduced a maintenance mode flag to temporarily prevent access to our app and web flows with a nice explanatory message.

All this was important so that we could minimize surprise and frustration from internal and external users.

This strategy worked. We were able to move most of our services from Heroku to RDS with no downtime.

It’s been several months of planning, testing, and figuring out hiccups, and it was fulfilling to finally pull it off. Thanks to everyone on the team, especially Anar Bayramov and Dan Svetlov, for their support. Now we switch to monitoring our new setup and ensuring it performs well.