Time reduction battle while migrating databases from hosted PostgreSQL to Google Cloud SQL

This post is about the problems I have encountered in the process of migrating a PostgreSQL database as a whole and how I’ve managed those problems. This article can also shed light on:
- How many different scenarios can I follow while migrating?
- How long do these scenarios take?

In my previous post I’ve mentioned why and how to migrate PostgreSQL databases to Google Cloud SQL, but in some cases, the migration process may take longer than expected. In this post, if you encounter such a situation, I wanted to note the scenarios in which you can follow to shorten the migration time as my own case. According to my case, the time had to be shortened because write requests to the database would be closed during this transport process, which would cause downtime.


Scenario 01: Just pg_dump and import

In the instance (instance name: source-instance) where PostgreSQL is to be migrated, the duration of the first export and import operation is measured with the following commands:

Export:
pg_dump -h source-instance-IP -U user --no-owner --no-acl -d database> prod.sql
time: 1h9m

Import:
psql -h cloud-sql-instance-IP -U user -d database < prod.sql
time: 1h44m

source-instance’s machine type:

n1-standard-8 (8 vCPUs, 30 GB memory)
Boot disk: 40 GB Standard Persistent Disk
Additional disks: 1100GB Standard Persistent Disk (PostgreSQL data in there.)

cloud-sql-instance’s machine type:

Custom(8 vCPUs, 40 GB memory)
Storage type: SSD
Storage capacity: 90 GB

On my first scenario, the elapsed time was too long. There are two reasons: 
 1 - Database was large. The size of the dump file(prod.sql) is 54GB.
 2 - The source-instance had a standard persistent disk, so export operation took a long time.

Scenario 02: Table by table migration

We contacted the developers of the project which the database was related. We discussed whether table by table migration is possible. After the conclusion that such a path could be followed, we started.
1 - Priority tables that will cause downtime will be determined.
2 - All tables except the priority tables will be migrated to the cloud-sql-instance.
3 - The export and import durations of the priority tables will be measured by the following commands.
First, we exported and imported the schema into cloud-sql-instance. We performed these operations in the source-instance.

Export:
pg_dump -h source-instance-IP -U user --schema-only --no-owner --no-acl -d database > prod.sql

Import:
psql -h cloud-sql-instance-IP -U user -d database < prod.sql
---
Export:
pg_dump -U user --data-only --table=.. --table=.. --table=.. --no-owner --no-acl -d database > parted-prod.sql

Import:
psql -h cloud-sql-instance-IP -U user -d database < parted-prod.sql

During export, the partial migration was not as expected. We encountered the following error:

pg_dump: NOTICE: there are circular foreign-key constraints on this table:
pg_dump:   some_table
pg_dump: You might not be able to restore the dump without using --disable-triggers or temporarily dropping the constraints.
pg_dump: Consider using a full dump instead of a --data-only dump to avoid this problem.

We had to give up this scenario because there were some tables with circular foreign-key.

Scenario 03: Dump and restore in parallel

In this scenario, we performed the measurements using the “-j” flag.

Export:
time pg_dump -h source-instance-IP -Fd -j 8 -U user --no-owner --no-acl -f ./database -d database
real 38m40.278s
user 27m34.348s
sys 0m54.162s

Import:
pg_restore -h cloud-sql-instance-IP -U user -j 8 --format=d -C -d database ./database
real 89m54.693s
user 4m25.757s
sys 0m38.483s

You can use pg_restore with -j flag to restore a dump in parallel, also pg_dump in the same way. The value we specify with -j is the number of cores of source-instance so it was 8 cores.

In conclusion, there were reasonable differences in duration, but the duration was still not short enough.

Scenario 04: Bigger instance

We created a new instance with SSD persistent disk in Google Cloud Compute Engine and tried again export and import operations with the following commands.

The export operations in different ways:

Exports:
time pg_dump -h source-instance-IP -Fd -j 16 -U user --no-owner --no-acl -f ./database -d database
real 38m28.932s
user 23m30.278s
sys 1m17.165s

time pg_dump -h source-instance-IP -U user -Fc -Z0 -c -d database > prod.sql
real 63m13.771s
user 1m25.059s
sys 3m24.400s

The import of dumps with directory format:

Import:
time pg_restore -h cloud-sql-instance -U user -j 16 --format=d -C -d database ./database
real 92m9.999s
user 4m6.753s
sys 0m53.415s

The export and import operations at the same time:

Export and import:
time pg_dump -h source-instance-IP -U user -Fc -Z0 -c -d database | pg_restore -Fc -h cloud-sql-instance -U user -d database
real 116m0.021s
user 2m40.365s
sys 3m44.818s

The machine type of the instance created for this scenario:

n1-standard-16 (16 vCPUs, 60GB memory)
Boot disk: 100 GB SSD Persistent Disk

Although the results improved, this duration was also not acceptable somehow.

Scenario 05: Replica with Local SSD Scratch Disk

As I performed earlier, the source-instance had a Standard Persistent Disk, so the export was not efficient and it was taking too long. Therefore, we decided to create a replica instance in Google Cloud Compute Engine and then we performed the export and import operations in this replica instance.

We created the pgsql-slave-for-migration-instance using Local SSD disk in Google Cloud Compute Engine. We connected to the master and provided replication. When replica was synchronized, replication was stopped and the export operation was started:

Export and import:
time pg_dump -h localhost -Fc -Z0 -c -U user -d database| pg_restore -Fc -h cloud-sql-instance -d database -U user
real 89m30.740s
user 1m56.161s
sys 1m57.578s

We tried again the “-j” flag and obtained the following results:

Export:
time pg_dump -h localhost -Fd -j 8 -U user --no-owner --no-acl -f ./database -d database
real 6m16.830s
user 28m52.922s
sys 0m42.274s

Import:
time pg_restore -h cloud-sql-instance -U user -j 8 --format=d -C -d database ./database
real 86m40.298s
user 4m1.859s
sys 0m31.487s

pgsql-slave-for-migration-instance’s machine type:

n1-standard-8 (8 vCPUs, 30 GB memory)
Boot disk: 40 GB SSD Persistent Disk
Additional disks: 375 GB Local SSD Scratch Disk (mounted PostgreSQL data in there.)

Although the results improved, we were still far away from what we wanted in terms of duration.

Scenario 06: Reduction of database size

We contacted the developers of the project which the database was related again. We reported that the database size is too large and unnecessary data should be cleaned. Fortunately, there were some data that could be deleted. We truncated the related tables by connecting to the PostgreSQL in the source-instance with the following command:

truncate table-name;

The database size has been reduced to 47GB.

Then, when we tried again the shortest scenario, the following results were obtained:

time pg_dump -h localhost -Fc -Z0 -c database -U user | pg_restore -Fc -h cloud-sql-ınstance -d database -U user
real 70m37.053s
user 1m41.392s
sys 1m48.266s

The results were still not good enough because the duration was not less than 1 hour.

Scenario 07: The way of Google Cloud

We tested the durations of export and import with the path that was suggested in google cloud’s own documentation as I have written in my previous post.

When we ran the command in the pgsql-slave-for-migration instance, we got the following results:

Export:
time pg_dump -h localhost -U user --format=plain --no-owner --no-acl database     | sed -E 's/(DROP|CREATE|COMMENT ON) EXTENSION/-- \1 EXTENSION/g' > prod.sql
real 9m46.771s
user 3m52.181s
sys 3m27.450s
(prod.sql size is 47GB.)
Sending to Google Cloud Storage:
time gsutil cp -r prod.sql gs://pgsql-master-dump/
real 7m29.318s
user 3m21.988s
sys 1m6.483s

When importing database to cloud-sql-instance in GC SQL, the process took ~ 56m. The expected reduction in duration(<60m) did not occur in this scenario.

Scenario 08: Bigger storage capacity

We explained the situation to Google Cloud Support by opening a ticket and we received the following response from Cloud SQL specialist:

The import speed is being impacted by the Cloud SQL instance disk size as we could observe in our graphs for the persistent disk a high write throttling. To speed up you will actually need to have higher storage volume.

The storage capacity of cloud-sql-instance was then increased from 90 GB to 200 GB. Then, when importing database to cloud-sql-instance in GC SQL, the process took this time ~ 37m.

When we tried the same process with the shortest scenario (70m) so far, we got the following results:

time pg_dump -h localhost -Fc -Z0 -c database -U user | pg_restore -Fc -h cloud-sql-instance-IP -d database -U user
real    46m30.380s
user 1m59.097s
sys 1m40.243s

This is the result that we expected.

We identified and considered most of the methods during the migration process. When we tested all of these scenarios, we obtained a far better result than a 1-hour target. According to the monitoring system, we scheduled the day and duration when network traffic was lowest. We performed the necessary plan for the migration operation.

Resources: