Migrating and Converting Databases to Amazon Aurora RDS Postgres

Peter Tran
Simply Wall St
Published in
13 min readMay 9, 2022

The Motivation

As our customer base and engineering team continue to grow at SimplyWallSt (SWS), the database architecture is a continually evolving piece.

Back in SWS’s infancy in 2014, we would build our MVP on single instance databases. Over time, the database server instance size would continue to grow as features and popularity in the product grew. In early 2019, we moved to a master/replica 3-tier set-up to accommodate the high demand in user traffic and data processing. Fast-forward from 2019 to the present day in 2022, SWS has seen a surge of growth in terms of platform activity, customer base and our internal engineering team size. The database has again become a bottleneck — not from high activity or user traffic, but from future growth and an auto-scaling perspective, it was a limitation.

As detailed in earlier blog posts, we have migrated our infrastructure from IBM cloud to AWS. The last piece of our migration from IBM Cloud (and admittedly the most complex), was to move our main database from IBM bare metal (dedicated) servers to managed AWS RDS instances.

To add to the migration complexity, at the same time we also wanted to convert existing database platforms that weren’t in Postgres, to the Postgres engine platform.

To understand the motivation behind this, find out why below….

The End Goal

The final design of our database looks like this:

  • Amazon Aurora RDS — PostgreSQL 13.3
  • Running on db.r6g.8xlarge instances
  • High-availability with auto-scaling Reader/Writer instances

While it would have been simpler to bring the platform completely offline to conduct the migration, we opted to set the database to read-only mode and keep most of our applications running so company reports are still accessible to public users. This reduces the financial impact on SWS and allows us to carry out the migration behind the scenes. The read-only state remains intact until our applications cut over to use the new database on Amazon RDS. See our other blog for details on how we orchestrated this.

Why The Move?

At SWS, we like to take a long-term outlook because we have a mission statement to make investing accessible for everyone regardless of demographic constraints. Unanimously, the engineering team agreed that to position ourselves for success, we needed our database to be a resource positioned for high scalability and growth.

To do this, we will leverage the use of managed services.

Some key issues that would be solved in our migration include:

  • Database Engine Fragmentation — We have fragmentation of database engines across our platform. This is more prevalent with our older/legacy services. The newer services provisioned in the last 12–18 months are using Postgres. Our financial data provider more recently has also released support for Postgres. For the sake of developer experience and maintenance, we’re unifying our database platform of choice to work with PostgreSQL relational databases.
  • Server Scalability — Our existing database servers were configured in a master/replica 3-tier server arrangement built for fault tolerance. We had 3x bare metal (dedicated) servers in a Quorum-based replication configuration.
    - As these are servers built using hardware with physical moving parts, performing server maintenance or upgrading server resources requires downtime with planned maintenance windows in a minimum 3-hour block with failover procedures required beforehand. The danger as well that we’ve seen first-hand is failures relating to physical hardware (e.g corruption to RAID controllers, S.M.A.R.T. drive test failures). The longer the downtime, the riskier the partial outage becomes because our backups momentarily stop, synchronous commits stop, transaction logs begin to grow and it generally takes longer for the servers to re-synchronise transactions. The server specs were as follows:
Server: Dual Intel Xeon Gold 5120 (28 Cores, 2.20 GHz)
Operating System: Ubuntu Linux 18.04 LTS Bionic Beaver Minimal Install (64 bit)
RAM: 64 GB RAM
Hard Drive: 3.8TB
SSD Secondary HDD: 4TB SATA (for backups)
  • Cost — When we compared the computational power of our existing bare-metal dedicated servers to the db.r6g.8xlarge instance sizes on AWS, the yield and cost to operate Amazon RDS Postgres in the same configuration were comparable to our existing set-up, so there wasn’t a negative cost impact to us. While we had to sacrifice computational power, the trade-off is a slightly higher utilised RDS instance when the site is sitting at normal traffic levels, but the service is managed and auto-scaling instances will kick in for any unexpected surges of activity.
  • Maintenance and Expertise — The majority of our current engineers have experience with Postgres. The landscape of the platform has changed since our early MVP days, so it makes sense to adopt Postgres as our primary platform. Postgres also works well with our container images, given there’s proper PDO driver support. This also makes it possible to do complex cross-server SQL queries.

The Strategy

In this article, we’ll focus solely on the database migration process itself.

We schedule a 1-hour downtime period and stopped all write transactions occurring while the migration took place.

Overall, the process looks like this:

  • Pre-Preparation Work
  • Utilising A Migration Script
  • Setting Up A Safe, Testable Environment
  • Schedule Maintenance Downtime
  • Amazon DMS — Full Data Migration Load
  • Debugging and Troubleshooting Tips

On-going Replication Strategy Challenges

Initially, we did consider adopting a highly-available migration with ongoing data replication with writes to the database is supported, but this was met with the following challenges:

  • Transaction logs can grow infinitely — We use a backup model at SWS that supports transaction logs, which records every transaction on the database so we can restore them to any given point in time. Transaction logs can grow infinitely if backups are not performed on a regular basis. They normally shrink only when a backup is performed and if replication is turned on, all subscribers need to send through acknowledgements otherwise, transaction log shrinking cannot occur.
  • On-going replication is complex to set up — On the source database, this required us to set up a distribution database, publishers and subscribers. These generally rely on transaction logs to replicate transactions to the subscribing (or target) servers. This presents a challenge for us and how our database architecture is currently configured:
    - Master/slave replication interferes with ongoing data replication — Our current configuration is set to perform synchronous commits to maintain a master/replica quorum. In the event a failover occurs, another server can take place as the master server. When this failover event occurs due to unexpected downtime on our master server, transaction logs begin to grow because our backups are interrupted. Transaction logs will grow infinitely until the main server is restored to its master status and all servers are synchronised and back to a healthy state.
  • Disk space is finite — Our existing set-up usually works with ~40% free disk space on the master database server. Given that transaction logs can grow infinitely and for the amount of data processing incurred to analyse companies daily, this poses a risk where if our migration runs were to run for long periods, we risk filling up the server’s disk space to 100%. Beyond this point, it is extremely difficult to recover from as we’re hard-pressed in finding free-disk space to run an ad-hoc database backup to shrink the transaction logs.

With the above complexities in mind, we opted to introduce a small window of downtime that stops all writes to the source and target databases until the migration is complete.

Pre-Preparation work

To speed up the process and promote concurrent data migration tasks, we’ve grouped related database tables into ‘domains’, otherwise known as migration tasks. This also makes testing and validating each subset of data easier. These domains include:

SimplyWallSt internal business domains

We also took this opportunity to do some maintenance tidy up by leaving behind old tables that are no longer used or referenced in code.

Utilising a migration script

We opted to build a migration script to facilitate the migration process. The script lives in a git version controlled project. The advantages of building a script to perform the migration process allow us to:

  • Produce repeatable and predictable migration test runs that can be replayed over and over again — this allowed us to fine-tune the script itself and improve the data load performance speeds over time;
  • Make it easy to perform repetitive tasks, such as wiping the slate clean which cascade drops existing tables for a new migration run;
  • Enable multiple users to contribute to the script and gives each user their own isolated environment to test the migration on; and
  • Maintain a history of minor changes we’re making to the migration script, easily reverting any problematic changes.
Our database migration script

Software Requirements

The migration instance requires the following to run:

sudo apt-get update sudo apt-get install -y unzip postgresql-client sbcl unzip libsqlite3-dev make curl gawk freetds-dev libzip-dev
# Install Postgres Client
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" |sudo tee /etc/apt/sources.list.d/pgdg.list
sudo apt update
sudo apt install postgresql-client-13
pg_dump --version
cd /tmp wget https://github.com/dimitri/pgloader/releases/download/v3.6.2/pgloader-bundle-3.6.2.tgz tar xvzf pgloader-bundle-3.6.2.tgz cd pgloader-bundle-3.6.2 sudo apt install make build-essential sbcl unzip make ./bin/pgloader --version sudo cp ./bin/pgloader /usr/local/bin/pgloader cd /tmp rm -rf /tmp/pgloader-bundle-3.6.2 rm /tmp/pgloader-bundle-3.6.2.tgz
cd /tmp curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install --bin-dir /usr/local/bin --install-dir /usr/local/aws-cli /usr/local/bin/aws --version rm -rf /tmp/aws rm /tmp/awscliv2.zip

Setup DMS resources and populate environment settings

The Amazon Resource Names (ARN’s) for the resources create below were used to populate a .env file in the migration script below.

Create DMS Resources

  1. Create a DMS replication instance
  2. Create a DMS source database endpoint
  3. Create a DMS target database endpoint

Populate .env file

We have a .env file that is used to refer to AWS ARN’s and database connection details to perform the necessary load tasks:

REPLICATION_INSTANCE=arn:aws:dms:ap-southeast-2:xxx:rep:xxx
SOURCE_ENDPOINT_ARN=arn:aws:dms:ap-southeast-2:xxx:endpoint:xxx
SOURCE_ENDPOINT_IDENTIFIER=<source-endpoint-name>
TARGET_ENDPOINT_ARN=arn:aws:dms:ap-southeast-2:xxx:endpoint:xxx
TARGET_ENDPOINT_IDENTIFIER=<target-endpoint-name>
SOURCE_DB_USERNAME=<source-db-username>
SOURCE_DB_HOST=<source-db-host>
SOURCE_DB_NAME=<source-db-name>
SOURCE_DB_PASSWORD=<source-db-password>
TARGET_DB_USERNAME=<target-db-username>
TARGET_DB_HOST=xxx.cluster-xxx.ap-southeast-2.rds.amazonaws.com
TARGET_DB_NAME=<target-db-name>
PGPASSWORD=<postgres-password>
SCHEMA=public

These allow us to connect to our existing resources to run load tasks.

Each individual contributor can have their own namespace to work with by changing the SCHEMA environment variable to their own name (e.g petert).

Migration Script

We have a bash script that runs using ./run-dms-task.sh

#!/bin/bash
set -e
TASK=$(echo $1 | tr '[:upper:]' '[:lower:]')
START=false
RESUME=false
STOP=false
SLATE=false
ENV_FILE="./.env"
prepare_premigration () {
JSONB_CONVERSION_FILE="./tasks/$TASK/jsonb-conversion.sql"
if test -f "$JSONB_CONVERSION_FILE"; then
echo "[SQL] Convert existing text type JSON columns to JSONB..."
psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -a -c "$(envsubst < $JSONB_CONVERSION_FILE)"
fi
DROP_INDEX_KEYS_FILE="./tasks/$TASK/drop-keys.sql"
if test -f "$DROP_INDEX_KEYS_FILE"; then
echo "[SQL] Dropping all index and unique keys..."
DROP_INDEX_KEYS_SQL=$(psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -tA -c "$(envsubst < $DROP_INDEX_KEYS_FILE)")
psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -a -c "$DROP_INDEX_KEYS_SQL"
fi
DROP_FOREIGN_KEYS_FILE="./tasks/$TASK/drop-foreign-keys.sql"
if test -f "$DROP_FOREIGN_KEYS_FILE"; then
DROP_FOREIGN_KEYS_SQL=$(psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -tA -c "$(envsubst < $DROP_FOREIGN_KEYS_FILE)")
if [ ! -z "$DROP_FOREIGN_KEYS_SQL" ]
then
echo "[SQL] Dropping foreign keys..."
psql -h ${TARGET_DB_HOST} -U ${TARGET_DB_USERNAME} ${TARGET_DB_NAME} -a -c "$DROP_FOREIGN_KEYS_SQL"
fi
fi
}
if [ -f "$ENV_FILE" ]
then
export $(cat .env | xargs)
else
echo "Missing '.env' file. Run 'cp .env.dist .env' and populate with environment variables"
exit 1
fi
# PgLoader will seek out for any load options if we specify them in a PGOPTIONS env variable
PGOPTIONS="--search_path=$SCHEMA"
args=( "$@" )
for arg; do
case $arg in
"--start")
START=true;
;;
"--resume")
RESUME=true;
;;
"--stop")
STOP=true;
;;
"--slate")
SLATE=true;
;;
esac
done
if [ "$START" = true ] && [ "$RESUME" = true ]; then
echo "Cannot start and resume at the same time. Choose one option only"
exit 1;
fi
SCHEMA_EXISTS=$(psql -h "$TARGET_DB_HOST" -U "$TARGET_DB_USERNAME" "$TARGET_DB_NAME" -tA -c "SELECT EXISTS (SELECT FROM information_schema.tables WHERE table_schema = '$SCHEMA')")
if [ "$SLATE" = true ]; then
PGLOADER_FILE="./tasks/$TASK/pgloader.cmd"
if test -f "$PGLOADER_FILE"; then
echo "[DELETE] TARGET schema. Start with a CLEAN SLATE"
SLATE_SQL=$(envsubst < "./tasks/$TASK/slate.sql")
psql -h "$TARGET_DB_HOST" -U "$TARGET_DB_USERNAME" "$TARGET_DB_NAME" -tA -c "$SLATE_SQL"
pgloader "./tasks/$TASK/pgloader.cmd"
fi
prepare_premigration
elif [ "$SCHEMA_EXISTS" = "f" ]; then
echo "[CREATE] TARGET schema"
pgloader "./tasks/$TASK/pgloader.cmd"
prepare_premigration
fi
# Check if replication task exists
echo "[LOOKUP] Check if existing replication tasks exists"
DMS_REPLICATION_TASK_ARN=$(aws dms describe-replication-tasks --output text --query "ReplicationTasks[?(ReplicationTaskIdentifier=='$SCHEMA-$TASK')].["ReplicationTaskArn"]");
CREATE_REPLICATION_TASK=false
if [ -z "$DMS_REPLICATION_TASK_ARN" ]
then
CREATE_REPLICATION_TASK=true
elif [ "$SLATE" = true ]; then
aws dms delete-replication-task \
--replication-task-arn $DMS_REPLICATION_TASK_ARN > /dev/null
SLEEP_SECONDS=20
echo "[DELETE] Replication Task. Waiting $SLEEP_SECONDS seconds for resource to delete..."
sleep $SLEEP_SECONDS
CREATE_REPLICATION_TASK=true
fi
if [ "$CREATE_REPLICATION_TASK" = true ]; then
TARGET_ENDPOINT_ARN=$(aws dms describe-endpoints --output text --query "Endpoints[?(EndpointIdentifier=='$TARGET_ENDPOINT_IDENTIFIER')].[EndpointArn]")
SOURCE_ENDPOINT_ARN=$(aws dms describe-endpoints --output text --query "Endpoints[?(EndpointIdentifier=='$SOURCE_ENDPOINT_IDENTIFIER')].[EndpointArn]")
mapping=$(envsubst < "./tasks/$TASK/dms-table-mapping.json")
replicationtasksettings=$(envsubst < "./tasks/default-dms-task-settings.json")
aws dms create-replication-task \
--replication-task-identifier $SCHEMA-$TASK \
--source-endpoint-arn $SOURCE_ENDPOINT_ARN \
--target-endpoint-arn $TARGET_ENDPOINT_ARN \
--replication-instance-arn $REPLICATION_INSTANCE \
--migration-type full-load \
--table-mappings "$mapping" \
--replication-task-settings "$replicationtasksettings" > /dev/null
echo "[CREATED] Replication Task '$SCHEMA-$TASK'"
fi
DMS_REPLICATION_TASK_ARN=$(aws dms describe-replication-tasks --output text --query "ReplicationTasks[?(ReplicationTaskIdentifier=='$SCHEMA-$TASK')].["ReplicationTaskArn"]");if [ "$START" = true ]; then
echo "[START] Replication Task"
aws dms start-replication-task \
--replication-task-arn $DMS_REPLICATION_TASK_ARN \
--start-replication-task-type start-replication > /dev/null
elif [ "$RESUME" = true ]; then
echo "[RESUME] Replication Task"
aws dms start-replication-task \
--replication-task-arn $DMS_REPLICATION_TASK_ARN \
--start-replication-task-type resume-processing > /dev/null
elif [ "$STOP" = true ]; then
echo "[STOP] Replication Task"
aws dms stop-replication-task \
--replication-task-arn $DMS_REPLICATION_TASK_ARN > /dev/null
fi
echo "[DONE]"

Migration script sequence

The migration script performs the following processes in order:

  1. Convert Schema and Create an AWS DMS Task
    - Drop all primary indexes, unique indexes and foreign key constraints on the target database
    - Convert existing json_array type columns from varchar to JSONB on the target database
  2. Start the full load AWS DMS replication task
  3. Restore foreign constraints
  4. Re-sync incrementing sequence numbers
  5. Restore indexes

1. Convert Schema and Create an AWS DMS Task

Setting up a migration task consists of:

  • Creating the schemas on the target database; and
  • Creating an AWS DMS replication task

These two steps are performed by running

./run-dms-task.sh <task-folder-name>
# e.g ./run-dms-task.sh events
Task folder with migration commands

During this run, the PgLoader command will be executed. An example of a PgLoader config we used during the migration looks like this:

LOAD DATABASE
FROM drivername://{{SOURCE_DB_USERNAME}}:{{SOURCE_DB_PASSWORD}}@{{SOURCE_DB_HOST}}/{{SOURCE_DB_NAME}}
INTO postgresql://{{TARGET_DB_USERNAME}}:{{PGPASSWORD}}@{{TARGET_DB_HOST}}/{{TARGET_DB_NAME}}
WITH include drop, create tables, create indexes, reset no sequences, foreign keys, preserve index names, schema onlyINCLUDING ONLY TABLE NAMES LIKE 'companyanalysis%' in schema 'default'CAST type numeric when (= precision 28) to numeric keep typemod,
type datetime2 to timestamp drop default drop not null using zero-dates-to-null
ALTER SCHEMA 'default' RENAME to '{{SCHEMA}}'AFTER LOAD DO
$$ DROP EXTENSION IF EXISTS "uuid-ossp"; $$,
$$ CREATE extension "uuid-ossp"; $$;

* Note that we intentionally drop all indexes and key constraints after the schema is created on the target database to maximise data loading throughput performance, as per AWS DMS best practices.

The SQL to drop the indexes is generated by running:

select format('DROP INDEX IF EXISTS %I.%I CASCADE;', schemaname, indexname) as drop_statement
from pg_indexes
where schemaname = '${SCHEMA}'
AND (tablename LIKE '<table-prefix>%')
AND indexname NOT IN (
SELECT con.conname
FROM pg_catalog.pg_class c
INNER JOIN pg_catalog.pg_constraint con ON c.oid = con.conrelid
INNER join pg_catalog.pg_namespace n on n.oid = c.relnamespace)
ORDER BY indexname;

The SQL to drop foreign constraints is generated by running:

select format('ALTER TABLE %I.%I DROP CONSTRAINT %I;', table_schema, table_name, constraint_name)
from information_schema.table_constraints
where table_schema = '${SCHEMA}'
AND constraint_type = 'FOREIGN KEY'
AND table_name LIKE '<table-prefix>%';

* We additionally want to take advantage of JSONB when we migrate to PostgreSQL for efficient parsing and querying of JSON data:

ALTER TABLE ${SCHEMA}.<table>
ALTER COLUMN <column> TYPE JSONB USING <column>::jsonb;

2. Start the full load AWS DMS replication task

Starting the migration is as simple as running:

./run-dms-task.sh <task-folder-name> --start
# e.g ./run-dms-task.sh events

3. Restore Foreign Constraints

This would be a series of straight-forward SQL commands we would run to restore existing foreign constraints. e.g.

ALTER TABLE ${SCHEMA}.<table> ADD CONSTRAINT <foreign-key-id> FOREIGN KEY (<column>)
REFERENCES ${SCHEMA}.<table> (<column>) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION;

4. Re-Sync Incrementing Sequence Numbers

On PostgreSQL, we can re-sync the sequence numbers to match the max column ID

SELECT setval('${SCHEMA}.<table>_id_seq', max(id)) FROM ${SCHEMA}.<table>;

5. Restore Indexes

This would be a series of straight-forward SQL commands we would run to restore existing indexes. e.g.

CREATE INDEX IF NOT EXISTS <index-id> ON ${SCHEMA}.<table> USING btree (<column>);

Conclusion

What gave us the most confidence to conduct the database migration was that we were able to perform the migration tasks over and over again in a controlled environment with the ease of wiping the slate clean every time.

The more practice runs we did of the migration, the more we were able to tweak the migration script and additionally improve the migration speed and our overall confidence.

While AWS DMS is able to perform schema conversions as part of the DMS replication task, it was not as flexible in the amount of post-conversion follow-up changes we needed to do.

We decided to perform the schema conversions using PgLoader because it gave us greater control of how to do the conversion and the ability to perform POST LOAD SQL statements — allowing us to alter columns on the target database on the fly, create custom triggers, functions, views and to load in extensions. PgLoader didn’t give us enough flexibility in excluding certain indexes and constraints on-the-fly, so we migrated these by hand as a follow-up action in our migration script.

As per the best practices for DMS, Amazon recommends dropping indexes and foreign constraints so we were able to bake this into our script and re-build the indexes and constraints afterwards — this significantly reduced the migration time from what would have been a duration of 7+ hours to under 1-hour for tables that contained more than 190m+ records and tables with several large JSON BLOB columns.

--

--

Peter Tran
Simply Wall St
0 Followers
Editor for

Backend/Platform Engineer at Simply Wall St