RDS Postgres Migration with Zero Downtime

Image for post
Image for post
Image from Pixabay — Airplane Engine

Have you ever wondered what it would feel like to change planes while it is in flight? I for sure do not want to be on that flight, but that is exactly what we had to do at Calm!

As our company is growing and expanding, we felt the need to continue to put an important emphasis on safeguarding customer data, our engine for the app, and that is exactly what we did here, and all with zero downtime.

Problem Statement

  • Encrypt our existing RDS instance
  • Move our existing RDS instance from the original AWS account to the production account in our Organizations setup
  • Do all of this with zero downtime

Investigation

We first started things off by looking at different options around using AWS native services for this setup but unfortunately, none of them met the requirements of zero downtime. We also considered other ETL tools we already had in place

  • Convert read replica to master
    Results in downtime
  • AWS Data Migration Service
    Didn’t support a lot of our current table structures and schema definitions
  • Stitch Data
    Results in downtime

After some research, we landed on using pglogical from 2ndquadrant which ticked all the boxes for us.

  • Works with RDS
  • Easy to setup
  • Fully integrated, requires no triggers or external programs to manage
  • Allows for mult-master postgres replication across AWS accounts
  • Zero downtime!

With the right tool in hand, we set off to get things started.

Preparing the new RDS Instance

  • Export the schema from source DB using pg_dump utility.
  • Ensure users are configured on the new instance.
  • Ensure parameter groups and option groups are consistent.
  • Use AWS customer-managed key (CMK) for encryption instead of default KMS key to allow us to manage the access to snapshots if necessary in the future across AWS accounts.

RDS Configuration

Update the parameter group values for the destination production DB

Update the source DB parameter group to support the pg logical extension as well so we have to apply the same values.

These values only take effect after a reboot on the DB, multi-AZ deployments are unavailable for the time it takes the instance to failover (usually about 60 seconds), this would be the only step that would require some blip in services and it was unavoidable.

Once the RDS configuration is complete we need to make sure there is connectivity between both RDS instances.

We configured access between both RDS instances via VPC peering.

Configuring PGLogical to do the magic

One Way Replication

Update source DB

— create extension
CREATE EXTENSION pglogical;
— create pglogical node
SELECT pglogical.create_node(
node_name := ‘postgres1’,
dsn := ‘host=source.xxx.region.rds.amazonaws.com port=5432 dbname=db user=pglogical password=TEMPPASSWORD’
);
— make everything available for replication
SELECT pglogical.replication_set_add_all_tables(‘default’, ARRAY[‘public’]);
SELECT pglogical.replication_set_add_all_sequences(‘default’, ARRAY[‘public’]);

Update destination DB

— create extension
CREATE EXTENSION pglogical;
— create node
SELECT pglogical.create_node(
node_name := ‘postgres2’,
dsn := ‘host=destination.xxx.region.rds.amazonaws.com port=5432 dbname=db user=pglogical password=TEMPPASSWORD’
);
— subscribe to source
SELECT pglogical.create_subscription(
subscription_name := ‘subscription_prod1_to_prod2’,
provider_dsn := ‘host=source.xxx.region.rds.amazonaws.com port=5432 dbname=db user=pglogical password=TEMPPASSWORD’,
synchronize_structure := false,
synchronize_data := true,
forward_origins := ‘{}’
);

The key here is that we have disabled synchronize_structure and enabled synchronize_data so that we are only synchronizing data between the two DBs. You can check the status of the replication using following SQL command:

Based on our instance type and bandwidth, and following our tests in the development environment, we estimated that it would take about 4 days to synchronize all our data from the current production DB to the replication DB.

At this point, we had one-way replication in progress!

Image for post
Image for post
One way replication configured

We used pgdatadiff to compare the dbs in addition to custom comparison metric to validate that all the data has been migrated to the destination instance and that live data is also flowing to the destination instance.

When making changes to production we have to account for any and all issues that might show up if things don’t go as planned. In preparation for a rollback we also wanted to configure destination->source replication as well so that in the event we had to perform a rollback, we would not lose any data.

Two Way Replication

SELECT pglogical.replication_set_add_all_sequences(‘default’, ARRAY[‘public’]);

Update source DB

 — check on the subscription
SELECT * FROM pglogical.show_subscription_status();

At this point, two-way replication between source and destination should be running without any errors!

Image for post
Image for post
Two way replication configured using pglogical

Updating your services

We manage our configuration using SSM Parameters Store. Our deployments are done through jenkins, helmfile and helm. We first updated the SSM parameter store to add the reference to the new RDS Instance. Once SSM parameters were in place, we next updated our services codebase using launch darkly to switch the DB on the fly without restarting pods. This was the magic sauce that allowed us to quickly switch from source DB to destination DB in a split second with no downtime.

Making the switch

Image for post
Image for post
Photo by James Basa from Pexels

With all the preparations that we had done around this, when we flipped the switch in LaunchDarkly to point to the new RDS instance, it was mostly uneventful, which is good news. We saw an increase in latency for about 5 mins as the app started using a new Db. No impact to any of our KPI metrics and infrastructure was running flawlessly.

It was a great journey for us here @Calm and we plan to use a similar mechanism in the future when we upgrade our Postgres instance in production. If there are other methods that you have used to move live workload please let us know! We’d love to hear from you!

Written by

DevOps, Meditation, Bhakti

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store