RDS Postgres Migration with Zero Downtime

Image from Pixabay — Airplane Engine

Have you ever wondered what it would feel like to change planes while it is in flight? I for sure do not want to be on that flight, but that is exactly what we had to do at Calm!

As our company is growing and expanding, we felt the need to continue to put an important emphasis on safeguarding customer data, our engine for the app, and that is exactly what we did here, and all with zero downtime.

Problem Statement

  • Move our existing RDS instance from the original AWS account to the production account in our Organizations setup
  • Do all of this with zero downtime

Investigation

  • Convert read replica to master
    Results in downtime
  • AWS Data Migration Service
    Didn’t support a lot of our current table structures and schema definitions
  • Stitch Data
    Results in downtime

After some research, we landed on using pglogical from 2ndquadrant which ticked all the boxes for us.

  • Works with RDS
  • Easy to setup
  • Fully integrated, requires no triggers or external programs to manage
  • Allows for mult-master postgres replication across AWS accounts
  • Zero downtime!

With the right tool in hand, we set off to get things started.

Preparing the new RDS Instance

  • Ensure users are configured on the new instance.
  • Ensure parameter groups and option groups are consistent.
  • Use AWS customer-managed key (CMK) for encryption instead of default KMS key to allow us to manage the access to snapshots if necessary in the future across AWS accounts.

RDS Configuration

max_replication_slots = 20
max_worker_process = 20
max_wal_senders = 20
rds.logical_replication = 1
shared_preload_libraries = pg_stat_statements,pglogical

Update the source DB parameter group to support the pg logical extension as well so we have to apply the same values.

max_replication_slots = 20
max_worker_process = 20
max_wal_senders = 20
rds.logical_replication = 1
shared_preload_libraries = pg_stat_statements,pglogical

These values only take effect after a reboot on the DB, multi-AZ deployments are unavailable for the time it takes the instance to failover (usually about 60 seconds), this would be the only step that would require some blip in services and it was unavoidable.

Once the RDS configuration is complete we need to make sure there is connectivity between both RDS instances.

We configured access between both RDS instances via VPC peering.

Configuring PGLogical to do the magic

One Way Replication

— create role and grant access
CREATE ROLE pglogical LOGIN ENCRYPTED PASSWORD ‘TEMPPASSWORD’;
GRANT rds_superuser to pglogical;
GRANT rds_replication to pglogical;
GRANT ALL ON ALL TABLES IN SCHEMA public TO pglogical;
GRANT ALL ON ALL SEQUENCES IN SCHEMA public TO pglogical;
— create extension
CREATE EXTENSION pglogical;
— create pglogical node
SELECT pglogical.create_node(
node_name := ‘postgres1’,
dsn := ‘host=source.xxx.region.rds.amazonaws.com port=5432 dbname=db user=pglogical password=TEMPPASSWORD’
);
— make everything available for replication
SELECT pglogical.replication_set_add_all_tables(‘default’, ARRAY[‘public’]);
SELECT pglogical.replication_set_add_all_sequences(‘default’, ARRAY[‘public’]);

Update destination DB

— create role and grant access
CREATE ROLE pglogical LOGIN ENCRYPTED PASSWORD ‘TEMPPASSWORD’;
GRANT rds_superuser to pglogical;
GRANT rds_replication to pglogical;
GRANT ALL ON ALL TABLES IN SCHEMA public TO pglogical;
GRANT ALL ON ALL SEQUENCES IN SCHEMA public TO pglogical;
— create extension
CREATE EXTENSION pglogical;
— create node
SELECT pglogical.create_node(
node_name := ‘postgres2’,
dsn := ‘host=destination.xxx.region.rds.amazonaws.com port=5432 dbname=db user=pglogical password=TEMPPASSWORD’
);
— subscribe to source
SELECT pglogical.create_subscription(
subscription_name := ‘subscription_prod1_to_prod2’,
provider_dsn := ‘host=source.xxx.region.rds.amazonaws.com port=5432 dbname=db user=pglogical password=TEMPPASSWORD’,
synchronize_structure := false,
synchronize_data := true,
forward_origins := ‘{}’
);

The key here is that we have disabled synchronize_structure and enabled synchronize_data so that we are only synchronizing data between the two DBs. You can check the status of the replication using following SQL command:

— check the status of replication
SELECT * FROM pglogical.show_subscription_status();

Based on our instance type and bandwidth, and following our tests in the development environment, we estimated that it would take about 4 days to synchronize all our data from the current production DB to the replication DB.

At this point, we had one-way replication in progress!

One way replication configured

We used pgdatadiff to compare the dbs in addition to custom comparison metric to validate that all the data has been migrated to the destination instance and that live data is also flowing to the destination instance.

When making changes to production we have to account for any and all issues that might show up if things don’t go as planned. In preparation for a rollback we also wanted to configure destination->source replication as well so that in the event we had to perform a rollback, we would not lose any data.

Two Way Replication

— make everything available for replication
SELECT pglogical.replication_set_add_all_tables(‘default’, ARRAY[‘public’]);
SELECT pglogical.replication_set_add_all_sequences(‘default’, ARRAY[‘public’]);

Update source DB

-- subscribe to destination
SELECT pglogical.create_subscription(
subscription_name := 'subscription_prod2_to_prod1',
provider_dsn := 'host=destination.xxxx.region.rds.amazonaws.com port=5432 dbname=db user=pglogical password=TEMPPASSWORD',
synchronize_structure := false,
synchronize_data := false,
forward_origins := '{}'
);
— check on the subscription
SELECT * FROM pglogical.show_subscription_status();

At this point, two-way replication between source and destination should be running without any errors!

Two way replication configured using pglogical

Updating your services

Making the switch

Photo by James Basa from Pexels

With all the preparations that we had done around this, when we flipped the switch in LaunchDarkly to point to the new RDS instance, it was mostly uneventful, which is good news. We saw an increase in latency for about 5 mins as the app started using a new Db. No impact to any of our KPI metrics and infrastructure was running flawlessly.

It was a great journey for us here @Calm and we plan to use a similar mechanism in the future when we upgrade our Postgres instance in production. If there are other methods that you have used to move live workload please let us know! We’d love to hear from you!

DevOps, Meditation, Bhakti