Migrating and Converting Databases to Amazon Aurora RDS Postgres

Published in

Simply Wall St

13 min readMay 9, 2022

The Motivation

As our customer base and engineering team continue to grow at SimplyWallSt (SWS), the database architecture is a continually evolving piece.

Back in SWS’s infancy in 2014, we would build our MVP on single instance databases. Over time, the database server instance size would continue to grow as features and popularity in the product grew. In early 2019, we moved to a master/replica 3-tier set-up to accommodate the high demand in user traffic and data processing. Fast-forward from 2019 to the present day in 2022, SWS has seen a surge of growth in terms of platform activity, customer base and our internal engineering team size. The database has again become a bottleneck — not from high activity or user traffic, but from future growth and an auto-scaling perspective, it was a limitation.

As detailed in earlier blog posts, we have migrated our infrastructure from IBM cloud to AWS. The last piece of our migration from IBM Cloud (and admittedly the most complex), was to move our main database from IBM bare metal (dedicated) servers to managed AWS RDS instances.

To add to the migration complexity, at the same time we also wanted to convert existing database platforms that weren’t in Postgres, to the Postgres engine platform.

To understand the motivation behind this, find out why below….

The End Goal

The final design of our database looks like this:

Amazon Aurora RDS — PostgreSQL 13.3
Running on db.r6g.8xlarge instances
High-availability with auto-scaling Reader/Writer instances

While it would have been simpler to bring the platform completely offline to conduct the migration, we opted to set the database to read-only mode and keep most of our applications running so company reports are still accessible to public users. This reduces the financial impact on SWS and allows us to carry out the migration behind the scenes. The read-only state remains intact until our applications cut over to use the new database on Amazon RDS. See our other blog for details on how we orchestrated this.

Why The Move?

At SWS, we like to take a long-term outlook because we have a mission statement to make investing accessible for everyone regardless of demographic constraints. Unanimously, the engineering team agreed that to position ourselves for success, we needed our database to be a resource positioned for high scalability and growth.

To do this, we will leverage the use of managed services.

Some key issues that would be solved in our migration include:

Database Engine Fragmentation — We have fragmentation of database engines across our platform. This is more prevalent with our older/legacy services. The newer services provisioned in the last 12–18 months are using Postgres. Our financial data provider more recently has also released support for Postgres. For the sake of developer experience and maintenance, we’re unifying our database platform of choice to work with PostgreSQL relational databases.
Server Scalability — Our existing database servers were configured in a master/replica 3-tier server arrangement built for fault tolerance. We had 3x bare metal (dedicated) servers in a Quorum-based replication configuration.
- As these are servers built using hardware with physical moving parts, performing server maintenance or upgrading server resources requires downtime with planned maintenance windows in a minimum 3-hour block with failover procedures required beforehand. The danger as well that we’ve seen first-hand is failures relating to physical hardware (e.g corruption to RAID controllers, S.M.A.R.T. drive test failures). The longer the downtime, the riskier the partial outage becomes because our backups momentarily stop, synchronous commits stop, transaction logs begin to grow and it generally takes longer for the servers to re-synchronise transactions. The server specs were as follows:

Server: Dual Intel Xeon Gold 5120 (28 Cores, 2.20 GHz)
Operating System: Ubuntu Linux 18.04 LTS Bionic Beaver Minimal Install (64 bit)
RAM: 64 GB RAM
Hard Drive: 3.8TB
SSD Secondary HDD: 4TB SATA (for backups)

Cost — When we compared the computational power of our existing bare-metal dedicated servers to the db.r6g.8xlarge instance sizes on AWS, the yield and cost to operate Amazon RDS Postgres in the same configuration were comparable to our existing set-up, so there wasn’t a negative cost impact to us. While we had to sacrifice computational power, the trade-off is a slightly higher utilised RDS instance when the site is sitting at normal traffic levels, but the service is managed and auto-scaling instances will kick in for any unexpected surges of activity.
Maintenance and Expertise — The majority of our current engineers have experience with Postgres. The landscape of the platform has changed since our early MVP days, so it makes sense to adopt Postgres as our primary platform. Postgres also works well with our container images, given there’s proper PDO driver support. This also makes it possible to do complex cross-server SQL queries.

The Strategy

In this article, we’ll focus solely on the database migration process itself.

We schedule a 1-hour downtime period and stopped all write transactions occurring while the migration took place.

Overall, the process looks like this:

Pre-Preparation Work
Utilising A Migration Script
Setting Up A Safe, Testable Environment
Schedule Maintenance Downtime
Amazon DMS — Full Data Migration Load
Debugging and Troubleshooting Tips

On-going Replication Strategy Challenges

Initially, we did consider adopting a highly-available migration with ongoing data replication with writes to the database is supported, but this was met with the following challenges:

Transaction logs can grow infinitely — We use a backup model at SWS that supports transaction logs, which records every transaction on the database so we can restore them to any given point in time. Transaction logs can grow infinitely if backups are not performed on a regular basis. They normally shrink only when a backup is performed and if replication is turned on, all subscribers need to send through acknowledgements otherwise, transaction log shrinking cannot occur.
On-going replication is complex to set up — On the source database, this required us to set up a distribution database, publishers and subscribers. These generally rely on transaction logs to replicate transactions to the subscribing (or target) servers. This presents a challenge for us and how our database architecture is currently configured:
- Master/slave replication interferes with ongoing data replication — Our current configuration is set to perform synchronous commits to maintain a master/replica quorum. In the event a failover occurs, another server can take place as the master server. When this failover event occurs due to unexpected downtime on our master server, transaction logs begin to grow because our backups are interrupted. Transaction logs will grow infinitely until the main server is restored to its master status and all servers are synchronised and back to a healthy state.
Disk space is finite — Our existing set-up usually works with ~40% free disk space on the master database server. Given that transaction logs can grow infinitely and for the amount of data processing incurred to analyse companies daily, this poses a risk where if our migration runs were to run for long periods, we risk filling up the server’s disk space to 100%. Beyond this point, it is extremely difficult to recover from as we’re hard-pressed in finding free-disk space to run an ad-hoc database backup to shrink the transaction logs.

With the above complexities in mind, we opted to introduce a small window of downtime that stops all writes to the source and target databases until the migration is complete.

Pre-Preparation work

To speed up the process and promote concurrent data migration tasks, we’ve grouped related database tables into ‘domains’, otherwise known as migration tasks. This also makes testing and validating each subset of data easier. These domains include:

We also took this opportunity to do some maintenance tidy up by leaving behind old tables that are no longer used or referenced in code.

Utilising a migration script

We opted to build a migration script to facilitate the migration process. The script lives in a git version controlled project. The advantages of building a script to perform the migration process allow us to:

Produce repeatable and predictable migration test runs that can be replayed over and over again — this allowed us to fine-tune the script itself and improve the data load performance speeds over time;
Make it easy to perform repetitive tasks, such as wiping the slate clean which cascade drops existing tables for a new migration run;
Enable multiple users to contribute to the script and gives each user their own isolated environment to test the migration on; and
Maintain a history of minor changes we’re making to the migration script, easily reverting any problematic changes.

Software Requirements

The migration instance requires the following to run:

AWS EC2 t2.micro instance running Ubuntu 20.04.4 LTS
Install supporting unix tools

sudo apt-get update sudo apt-get install -y unzip postgresql-client sbcl unzip libsqlite3-dev make curl gawk freetds-dev libzip-dev

postgresql-client-13

# Install Postgres Client
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" |sudo tee  /etc/apt/sources.list.d/pgdg.list
sudo apt update
sudo apt install postgresql-client-13
pg_dump --version

pgloader v3.6.2

cd /tmp wget https://github.com/dimitri/pgloader/releases/download/v3.6.2/pgloader-bundle-3.6.2.tgz tar xvzf pgloader-bundle-3.6.2.tgz cd pgloader-bundle-3.6.2 sudo apt install make build-essential sbcl unzip make ./bin/pgloader --version sudo cp ./bin/pgloader /usr/local/bin/pgloader cd /tmp rm -rf /tmp/pgloader-bundle-3.6.2 rm /tmp/pgloader-bundle-3.6.2.tgz

AWS CLI Version 2

cd /tmp curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install --bin-dir /usr/local/bin --install-dir /usr/local/aws-cli /usr/local/bin/aws --version rm -rf /tmp/aws rm /tmp/awscliv2.zip

Setup DMS resources and populate environment settings

The Amazon Resource Names (ARN’s) for the resources create below were used to populate a .env file in the migration script below.

Create DMS Resources

Populate .env file

We have a .env file that is used to refer to AWS ARN’s and database connection details to perform the necessary load tasks:

REPLICATION_INSTANCE=arn:aws:dms:ap-southeast-2:xxx:rep:xxx
SOURCE_ENDPOINT_ARN=arn:aws:dms:ap-southeast-2:xxx:endpoint:xxx
SOURCE_ENDPOINT_IDENTIFIER=<source-endpoint-name>
TARGET_ENDPOINT_ARN=arn:aws:dms:ap-southeast-2:xxx:endpoint:xxx
TARGET_ENDPOINT_IDENTIFIER=<target-endpoint-name>SOURCE_DB_USERNAME=<source-db-username>
SOURCE_DB_HOST=<source-db-host>
SOURCE_DB_NAME=<source-db-name>
SOURCE_DB_PASSWORD=<source-db-password>TARGET_DB_USERNAME=<target-db-username>
TARGET_DB_HOST=xxx.cluster-xxx.ap-southeast-2.rds.amazonaws.com
TARGET_DB_NAME=<target-db-name>
PGPASSWORD=<postgres-password>SCHEMA=public

These allow us to connect to our existing resources to run load tasks.

Each individual contributor can have their own namespace to work with by changing the SCHEMA environment variable to their own name (e.g petert).

Migration Script

We have a bash script that runs using ./run-dms-task.sh

#!/bin/bash
set -eTASK=$(echo $1 | tr '[:upper:]' '[:lower:]')
START=false
RESUME=false
STOP=false
SLATE=false
ENV_FILE="./.env"prepare_premigration () {
    JSONB_CONVERSION_FILE="./tasks/$TASK/jsonb-conversion.sql"
    if test -f "$JSONB_CONVERSION_FILE"; then
        echo "[SQL] Convert existing text type JSON columns to JSONB..."
        psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -a -c "$(envsubst < $JSONB_CONVERSION_FILE)"
    fi    DROP_INDEX_KEYS_FILE="./tasks/$TASK/drop-keys.sql"
    if test -f "$DROP_INDEX_KEYS_FILE"; then
        echo "[SQL] Dropping all index and unique keys..."
        DROP_INDEX_KEYS_SQL=$(psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -tA -c "$(envsubst < $DROP_INDEX_KEYS_FILE)")
        psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -a -c "$DROP_INDEX_KEYS_SQL"
    fi    DROP_FOREIGN_KEYS_FILE="./tasks/$TASK/drop-foreign-keys.sql"
    if test -f "$DROP_FOREIGN_KEYS_FILE"; then
        DROP_FOREIGN_KEYS_SQL=$(psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -tA -c "$(envsubst < $DROP_FOREIGN_KEYS_FILE)")
        if [ ! -z "$DROP_FOREIGN_KEYS_SQL" ]
        then
            echo "[SQL] Dropping foreign keys..."
            psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -a -c "$DROP_FOREIGN_KEYS_SQL"
        fi
    fi
}if [ -f "$ENV_FILE" ]
then
    export $(cat .env | xargs)
else
   echo "Missing '.env' file. Run 'cp .env.dist .env' and populate with environment variables"
   exit 1
fi# PgLoader will seek out for any load options if we specify them in a PGOPTIONS env variable
PGOPTIONS="--search_path=$SCHEMA"args=( "$@" )
for arg; do
    case $arg in
        "--start")
            START=true;
            ;;
        "--resume")
            RESUME=true;
            ;;
        "--stop")
            STOP=true;
            ;;
        "--slate")
            SLATE=true;
            ;;
    esac
doneif [ "$START" = true ] && [ "$RESUME" = true ]; then
    echo "Cannot start and resume at the same time. Choose one option only"
    exit 1;
fiSCHEMA_EXISTS=$(psql -h "$TARGET_DB_HOST" -U "$TARGET_DB_USERNAME" "$TARGET_DB_NAME" -tA -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_schema = '$SCHEMA')")
if [ "$SLATE" = true ]; then
    PGLOADER_FILE="./tasks/$TASK/pgloader.cmd"
    if test -f "$PGLOADER_FILE"; then
        echo "[DELETE] TARGET schema. Start with a CLEAN SLATE"
        SLATE_SQL=$(envsubst < "./tasks/$TASK/slate.sql")
        psql -h "$TARGET_DB_HOST" -U "$TARGET_DB_USERNAME" "$TARGET_DB_NAME" -tA -c "$SLATE_SQL"
        pgloader "./tasks/$TASK/pgloader.cmd"
    fi    prepare_premigration
elif [ "$SCHEMA_EXISTS" = "f" ]; then
    echo "[CREATE] TARGET schema"
    pgloader "./tasks/$TASK/pgloader.cmd"    prepare_premigration
fi# Check if replication task exists
echo "[LOOKUP] Check if existing replication tasks exists"
DMS_REPLICATION_TASK_ARN=$(aws dms describe-replication-tasks --output text --query "ReplicationTasks[?(ReplicationTaskIdentifier=='$SCHEMA-$TASK')].["ReplicationTaskArn"]");CREATE_REPLICATION_TASK=false
if [ -z "$DMS_REPLICATION_TASK_ARN" ]
then
    CREATE_REPLICATION_TASK=true
elif [ "$SLATE" = true ]; then
    aws dms delete-replication-task \
        --replication-task-arn $DMS_REPLICATION_TASK_ARN > /dev/null
    SLEEP_SECONDS=20
    echo "[DELETE] Replication Task. Waiting $SLEEP_SECONDS seconds for resource to delete..."
    sleep $SLEEP_SECONDS
    CREATE_REPLICATION_TASK=true
fiif [ "$CREATE_REPLICATION_TASK" = true ]; then
    TARGET_ENDPOINT_ARN=$(aws dms describe-endpoints --output text --query "Endpoints[?(EndpointIdentifier=='$TARGET_ENDPOINT_IDENTIFIER')].[EndpointArn]")
    SOURCE_ENDPOINT_ARN=$(aws dms describe-endpoints --output text --query "Endpoints[?(EndpointIdentifier=='$SOURCE_ENDPOINT_IDENTIFIER')].[EndpointArn]")    mapping=$(envsubst < "./tasks/$TASK/dms-table-mapping.json")
    replicationtasksettings=$(envsubst < "./tasks/default-dms-task-settings.json")    aws dms create-replication-task \
        --replication-task-identifier $SCHEMA-$TASK \
        --source-endpoint-arn $SOURCE_ENDPOINT_ARN \
        --target-endpoint-arn $TARGET_ENDPOINT_ARN \
        --replication-instance-arn $REPLICATION_INSTANCE \
        --migration-type full-load \
        --table-mappings "$mapping" \
        --replication-task-settings "$replicationtasksettings" > /dev/null
    echo "[CREATED] Replication Task '$SCHEMA-$TASK'"
fiDMS_REPLICATION_TASK_ARN=$(aws dms describe-replication-tasks --output text --query "ReplicationTasks[?(ReplicationTaskIdentifier=='$SCHEMA-$TASK')].["ReplicationTaskArn"]");if [ "$START" = true ]; then
    echo "[START] Replication Task"
    aws dms start-replication-task \
        --replication-task-arn $DMS_REPLICATION_TASK_ARN \
        --start-replication-task-type start-replication > /dev/null
elif [ "$RESUME" = true ]; then
    echo "[RESUME] Replication Task"
    aws dms start-replication-task \
        --replication-task-arn $DMS_REPLICATION_TASK_ARN \
        --start-replication-task-type resume-processing > /dev/null
elif [ "$STOP" = true ]; then
    echo "[STOP] Replication Task"
    aws dms stop-replication-task \
        --replication-task-arn $DMS_REPLICATION_TASK_ARN > /dev/null
fiecho "[DONE]"

Migration script sequence

The migration script performs the following processes in order:

Convert Schema and Create an AWS DMS Task
- Drop all primary indexes, unique indexes and foreign key constraints on the target database
- Convert existing json_array type columns from varchar to JSONB on the target database
Start the full load AWS DMS replication task
Restore foreign constraints
Re-sync incrementing sequence numbers
Restore indexes

1. Convert Schema and Create an AWS DMS Task

Setting up a migration task consists of:

Creating the schemas on the target database; and
Creating an AWS DMS replication task

These two steps are performed by running

./run-dms-task.sh <task-folder-name>
# e.g ./run-dms-task.sh events

During this run, the PgLoader command will be executed. An example of a PgLoader config we used during the migration looks like this:

LOAD DATABASE
FROM drivername://{{SOURCE_DB_USERNAME}}:{{SOURCE_DB_PASSWORD}}@{{SOURCE_DB_HOST}}/{{SOURCE_DB_NAME}}
INTO postgresql://{{TARGET_DB_USERNAME}}:{{PGPASSWORD}}@{{TARGET_DB_HOST}}/{{TARGET_DB_NAME}}WITH include drop, create tables, create indexes, reset no sequences, foreign keys, preserve index names, schema onlyINCLUDING ONLY TABLE NAMES LIKE 'companyanalysis%' in schema 'default'CAST type numeric when (= precision 28) to numeric keep typemod,
     type datetime2 to timestamp drop default drop not null using zero-dates-to-nullALTER SCHEMA 'default' RENAME to '{{SCHEMA}}'AFTER LOAD DO
  $$ DROP EXTENSION IF EXISTS "uuid-ossp"; $$,
  $$ CREATE extension "uuid-ossp"; $$;

* Note that we intentionally drop all indexes and key constraints after the schema is created on the target database to maximise data loading throughput performance, as per AWS DMS best practices.

The SQL to drop the indexes is generated by running:

select format('DROP INDEX IF EXISTS %I.%I CASCADE;', schemaname, indexname) as drop_statement
from pg_indexes
where schemaname = '${SCHEMA}'
  AND (tablename LIKE '<table-prefix>%')
  AND indexname NOT IN (
    SELECT con.conname
    FROM pg_catalog.pg_class c
             INNER JOIN pg_catalog.pg_constraint con ON c.oid = con.conrelid
             INNER join pg_catalog.pg_namespace n on n.oid = c.relnamespace)
ORDER BY indexname;

The SQL to drop foreign constraints is generated by running:

select format('ALTER TABLE %I.%I DROP CONSTRAINT %I;', table_schema, table_name, constraint_name)
from information_schema.table_constraints
where table_schema = '${SCHEMA}'
  AND constraint_type = 'FOREIGN KEY'
  AND table_name LIKE '<table-prefix>%';

* We additionally want to take advantage of JSONB when we migrate to PostgreSQL for efficient parsing and querying of JSON data:

ALTER TABLE ${SCHEMA}.<table>
  ALTER COLUMN <column> TYPE JSONB USING <column>::jsonb;

2. Start the full load AWS DMS replication task

Starting the migration is as simple as running:

./run-dms-task.sh <task-folder-name> --start
# e.g ./run-dms-task.sh events

3. Restore Foreign Constraints

This would be a series of straight-forward SQL commands we would run to restore existing foreign constraints. e.g.

ALTER TABLE ${SCHEMA}.<table> ADD CONSTRAINT <foreign-key-id> FOREIGN KEY (<column>)
  REFERENCES ${SCHEMA}.<table> (<column>) MATCH SIMPLE
  ON UPDATE NO ACTION
  ON DELETE NO ACTION;

4. Re-Sync Incrementing Sequence Numbers

On PostgreSQL, we can re-sync the sequence numbers to match the max column ID

SELECT setval('${SCHEMA}.<table>_id_seq', max(id)) FROM ${SCHEMA}.<table>;

5. Restore Indexes

This would be a series of straight-forward SQL commands we would run to restore existing indexes. e.g.

CREATE INDEX IF NOT EXISTS <index-id> ON ${SCHEMA}.<table> USING btree (<column>);

Conclusion

What gave us the most confidence to conduct the database migration was that we were able to perform the migration tasks over and over again in a controlled environment with the ease of wiping the slate clean every time.

The more practice runs we did of the migration, the more we were able to tweak the migration script and additionally improve the migration speed and our overall confidence.

While AWS DMS is able to perform schema conversions as part of the DMS replication task, it was not as flexible in the amount of post-conversion follow-up changes we needed to do.

We decided to perform the schema conversions using PgLoader because it gave us greater control of how to do the conversion and the ability to perform POST LOAD SQL statements — allowing us to alter columns on the target database on the fly, create custom triggers, functions, views and to load in extensions. PgLoader didn’t give us enough flexibility in excluding certain indexes and constraints on-the-fly, so we migrated these by hand as a follow-up action in our migration script.

As per the best practices for DMS, Amazon recommends dropping indexes and foreign constraints so we were able to bake this into our script and re-build the indexes and constraints afterwards — this significantly reduced the migration time from what would have been a duration of 7+ hours to under 1-hour for tables that contained more than 190m+ records and tables with several large JSON BLOB columns.