Migrating PostgreSQL standalone instance to Aurora (Amazon RDS)

Working at startup require fast changes. You start a new feature with proof of concept, and then on the success you want it to stay and have to make it stable. That’s not the case for standalone PostgreSQL instance we’ve had at Storyline for about a year now. We just grew out and had to change our infrastructure. We want our database to be easy to scale in a cloud, replicable, fault tolerant, and all that without any administrative hassle. In other words, we need to migrate to Amazon Relational Database Service (RDS), furthermore, we want it to be Aurora: PostgreSQL compatible relational database built for the cloud, that combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. The problem to be solved is how to migrate all of ours 15 Gb of data to Aurora with zero downtime.

AWS Database Migration Service (DMS) to rescue! Replication configuration of our standalone PostgreSQL server is the only thing needs to be done…

It turned out to be a bit more complicated…

The idea behind migration is simple: create Aurora cluster in RDS, setup database source and target endpoints at DMS, and launch database migration task. The data migration task will migrate all data from the source database, and replicate ongoing changes to the target endpoint. All you have to do is to point your web server to the new Aurora cluster, and hot reload web server once the source and target endpoints completely in sync. However, the devil is in the detail.

It took me about a day to figure out the right path for that migration. Our staging server had a couple of up and down cycles during endless retries of me figuring out what was wrong. Let me share my experience, while DMS migrates another million rows to Aurora.

Here the most important parts for the successful migration:

  1. The source database server has to be configured properly for replication
  2. DMS will not migrate your schema one to one, leaving out all the indices, column defaults, stored procedures, etc
  3. Auto incremental indices have to be fixed after migration

Another thing that bothered me, is AWS VPC security groups. For simplicity, I’ve set up a security group with the inbound and outbound rule of 0.0.0.0/0 Which was updated to more secure mask right after migration was completed.

VPC security group have gotten very loose

#1 Let’s start with source database configuration

First things first: source database has to accept remote connections and be configured for upcoming replication task. I’ve created a new DB user named repuser with superuser role and replication privilege:

sudo -u postgres createuser -U postgres repuser -P --replication --superuser

Once that is done, you can check new user roles via \du in psql console.

To allow remote connections and replication for that user, add these 2 lines to the very end of pg_hba.conf sudo vi /etc/postgresql/9.5/main/pg_hba.conf:

# Amazon DMS to Aurora
host mydb_prod repuser 0.0.0.0/0 md5 # mydb_prod - source database
host replication repuser 0.0.0.0/0 md5

For replication configuration set the following parameters and values in postgresql.conf sudo vi /etc/postgresql/9.5/main/posgresql.conf:

  • wal_level = logical
  • max_replication_slots = 5
  • max_wal_senders = 5
  • wal_sender_timeout = 0

…now to AWS console for Aurora cluster

OK. My source database (old production database) is set up and ready for DMS to take its’ data and load to Aurora cloud. Let’s jump to AWS straight to RDS section and create an instance of Aurora or cluster if you wish. Nothing fancy here, just follow the instructions. Putting a few screenshots for completeness:

Amazon Aurora Engine
DB instance details with Multi A-Z deployment

A note here — I did not create a database at this point, since I will load my schema manually later. But it’s up to you. You can create or delete DB anytime.

Store RDS user name and password for later reference, and once your RDS instance or cluster is ready, grab endpoint url of the instance or cluster. Should be something like “rds-aurora-prod.kdfoe7d9ww9do.us-east-2.rds.amazonaws.com”. We will need it for DMS target endpoint.

…replication instance, source and target endpoints in DMS

Alright! Now to DMS section in AWS. That’s where the job will be done. Four items to be created here:

  1. Replication instance — cloud computing power to handle the data migration, and ongoing replication.
  2. Source endpoint — connection settings for source database (that’s my production database connection)
  3. Target endpoint — connection settings for target database (that’s my newly created Aurora cluster)
  4. Task — the process that orchestrates endpoints and replication instance to get job done

Replication instance. My experience is 5 hours for 15Gbs of data migrated via dms.t2.medium instance. Choose wisely…

dms.t2.medium took about 5 hours to migrate 15Gbs of data

Source endpoint utilizes the configuration we did in part #1 for “repuser”. Later you can test your connections, to make sure no mistake sneaked in.

Source endpoint reply on configuration of repuser in #1

Target endpoint is as simple as checking “Select RDS instance” checkbox and choosing newly created aurora RDS instance from drop down.

Target endpoint is simple: check Select RDS instance and choose RDS instance from drop down

DMS Task, is where all the work happens. Its’ responsibility to migrate data from source endpoint to target endpoint, cache all changes that may happen during the initial data transfer and apply cached and ongoing changes to the target database.

DMS task is the heart of data migration

#2 DMS will not transfer your schema 1 to 1, neither indices!

A bit later after another “successful” migration, I’ve noticed column defaults in Aurora did not match the ones I’ve had in my original database. More than that, my sequences and indices were gone! Here my solution for that — in the Task settings:

  1. Uncheck “Start task on create”
  2. Set “Target table preparation mode” to Truncate
  3. Set “Stop task after full load completeness” to Stop before applying cached changes.

(I will load schema manually before I launch DMS task)

Target table preparation mode = Truncate AND Stop task after full load completeness = Stop Before

Here you will need to go on your own, and create schema in Aurora using credentials and endpoint you’ve got earlier. At Storyline we run Ruby on Rails at the back-end, if you do too, you might want to follow along next 3 steps:

  1. Change your config/database.yml with Aurora endpoint and credentials
  2. Run bin/rails db:create — to create database on Aurora
  3. Rub bin/rails db:schema:load — to load schema with indices to Aurora
IMHO: Most framework our days have some sort of wrapper for persistence layer, specifically for databases. You should be able to use tools they provide to easily switch database endpoint and load schema or rum migrations.

Since we’ve setup DMS Task “Target preparation mode” to Truncate, the schema will remain unchanged after the DMS migration, and all indices will be there. However the auto incremental sequences will be broken, and I will fix that at “After full load complete” momentum. Remember “Stop task after full load completeness” = Stop before applying cached changes?

#3 Fix broken auto-incremental sequences

Great! About 5 hours later since I’ve started my DMS Task for production database migration to Aurora the “Full load” step has completed, and now the whole process is paused and waiting for me to fix the auto-incremental indices. That’s a common problem when copying or recreating a database.

Create a file reset-seq.sql with content of:

SELECT 'SELECT SETVAL(' ||
quote_literal(quote_ident(PGT.schemaname) || '.' || quote_ident(S.relname)) ||
', COALESCE(MAX(' ||quote_ident(C.attname)|| '), 1) ) FROM ' ||
quote_ident(PGT.schemaname)|| '.'||quote_ident(T.relname)|| ';'
FROM pg_class AS S,
pg_depend AS D,
pg_class AS T,
pg_attribute AS C,
pg_tables AS PGT
WHERE S.relkind = 'S'
AND S.oid = D.objid
AND D.refobjid = T.oid
AND D.refobjid = C.attrelid
AND D.refobjsubid = C.attnum
AND T.relname = PGT.tablename
ORDER BY S.relname;

Execute the query in reset-seq.sql with this command:

psql -hAURORA_ENPOINT -UAURORA_USER -Atq -f reset-seq.sql -o temp.sql -dDATABASE_NAME

As a result temp.sql file will contain an indices reset query. Execute it with:

psql -f temp.sql -hAURORA_ENPOINT -UAURORA_USER -dDATABASE_NAME

Safely remove temp file rm temp.sql

At that point my database has all indices in place and working, and I can safely resume DMS migration process, which will apply cached changes an run ongoing changes replication.

FINAL: Update web server end-point and hot-restart web server

Wait for DMS Task to by fully synced. Once it is synced, update you config or ENV variables of web server to point to Aurora RDS, and hot-restart your web server.

How to know when Aurora is in sync with old production database? Aurora does not support pg_last_xact_reply_timestamp() and other functions related to replication. AFAIK the only way to figure out the lag between master and slave is Cloud Watch Metrics. You will be looking for CDCLatencyTarget:

CDCLatencyTarget

That’s it. Now you can safely shutdown your old production PostgreSQL standalone instance and enjoy the luxury of database in the cloud.


Those commands were useful during my journey to Aurora. Might be good for you too.

SELECT * FROM pg_replication_slots; # List replication slots
SELECT pg_drop_replication_slot('slot name'); # Close replication slot (they are not cleaned up automatically
pg_dump -hHOST_URL -UDATABASE_USER -dDATABASE_SCHEMA -F c -b -v -f production.dump
pg_restore -hHOST_URL --clean --no-acl --no-owner -dLOCAL_DATABASE_SCHEMA -v production.dump