Cloud SQL for PostgreSQL Optimization during Migration using Database Migration Service.

Somdyuti
Google Cloud - Community
6 min readApr 10, 2024
DMS now performs Parallel Full Load and Parallel CDC from PostgreSQL

Database Migration Service(DMS) makes it easier for you to migrate your data to Google Cloud. This service helps you lift and shift your MySQL and PostgreSQL workloads into Cloud SQL and AlloyDB for PostgreSQL. In addition, you can lift and modernize your Oracle workloads into Cloud SQL for PostgreSQL or AlloyDB for PostgreSQL.

In this document we will discuss how you can optimize the DMS Initial Load and CDC when migrating to Cloud SQL for PostgreSQL Instance. The source can be either Oracle or PostgreSQL.

The Cloud SQL for PostgreSQL Parameters

The suggested parameters need to be properly tested and verified against the chosen target Cloud SQL Instance type. They should not be set in your actual production workload.

1. max_wal_size= 20GB.

This will make sure that Database checkpoints happen when 20GB worth of WAL data is generated. If 20 GB worth of WAL data is generated > 5 minutes then checkpoints will happen every 5 minutes as per checkpoint_timeout setting.

Given that, during DMS Load with the default max_wal_size which is 1.5GB for Cloud SQL PostgreSQL Enterprise and 5GB for Enterprise Plus editions, checkpoints are happening every few seconds which increases the I/O and CPU. Higher value will increase the checkpoint frequency which will reduce the I/O footprint.

Also monitor the following wait events from Cloud SQL Console System Insights “WALWrite”- and the event_type will be “LWlock”. Be aware that WALWrite can be both a LWLock and an IO event_type. For frequent checkpoints it will manifest as LWLock event type as the CKPT process will wait for a latch on the WAL Segments to write the Checkpoint_change# (aka Oracle’s redo latches). High commits will manifest WALWrite as an IO wait event_type where the WAL Writer will be busy writing changes from WAL buffers to WAL Files.

There can be waits on “DataFileWrite” and “DataFileFlush” events also during very frequent and aggressive checkpoints.

2. commit_delay = 1000 (start with this and go upto 50000)

commit_delay sets the delay in microseconds between transaction commit and flushing WALs to disk. Basically it will help improve transaction throughput by performing batch commits during Bulk Inserts as it delays the WAL flush (which by default happens in every transaction commit) by 1000 microseconds after transaction commits , provided load is high enough to accumulate more transactions in that delay (which will be the case during DMS initial load)

Monitor the following wait events in System Insights “WALSync” , “WALWrite” which are IO wait event_types for waits related to high commits and also the ‘Transaction count’ metric in System Insights.

3. wal_buffers = 32–64 MB in 4 vCPU machines and 64–128 MB in 8–16 vCPU machines. It can be set to even 256MB for higher vCPU targets.

Smaller wal_buffers increase commit frequency, so increasing the value will help in initial load.

Again monitor the wait events as mentioned in (2) above.

4. Parallelism- As Postgres does not support parallel DMLs Bulk Inserts will not benefit.

5. autovacuum- Turn it to off

Note after the Initial Load is complete, make sure the autovacuum is turned On after running manual vacuum.

But run a manual vacuum first before releasing the database for actual production usage and set the following to make manual vacuum fast as it will have a lot of work to do first time.

max_parallel_maintenance_workers=4 (set it to number of vCPUs of the Cloud SQL Instance)

maintenance_work_mem=10GB

Note that manual vacuum will take memory from maintenance_work_mem.

Subsequently to make autovacuum faster, set

autovacuum_work_mem to 1GB, otherwise autovacuum workers will consume memory from maintenance_work_mem

From Cloud SQL PostgreSQL database parameter perspective we need to tune Checkpoints and Commits during DMS Initial Load (in general for any Bulk Load operations) as they significantly affect IOs and also to an extent CPU.

6. The below recommendation is very specific when the source is PostgreSQL as DMS now supports Parallel Full Load and Parallel CDC when migrating from PostgreSQL to Cloud SQL PostgreSQL or AlloyDB- faster PostgreSQL migrations

The following parameter settings will help in more optimized Initial Data Copy and CDC when using PGMS (PostgreSQL Multiple Subscriptions)

In the source PostgreSQL database

max_replication_slots- Set it to at least 20. It must be set to at least the number of subscriptions expected to connect which is max 10 subscriptions when DMS Parallelism is configured to Maximum (4 subscriptions per database), plus some reserve for table synchronization.

max_wal_senders- Set it to higher value, preferably 20 and at least same as max_replication_slots.This controls the maximum number of concurrent connections from the target Cloud SQL PostgreSQL. With DMS Parallelism configured to Maximum there can be 4 subscriptions created per database with a max of 10 subscriptions for the PostgreSQL Cluster.

Assuming the target Instance has enough vCPU and memory available

max_worker_processes- Set it to number of vCPUs in the target.

max_replication_slots- Set it to 20. It must be set to at least the number of subscriptions that will be added to the subscriber which can be upto 10, plus some reserve for table synchronization.

Even with PGMS(PostgreSQL Multiple Subscriptions), when the subscription is initialized , there can be only one synchronization worker per table.(which means a table cannot be copied in parallel). Tables are copied in parallel across the subscriptions/replication sets.

max_logical_replication_workers and max_sync_workers_per_subscription will not affect the DMS Parallelism as these parameters influence Native Logical Replication and DMS uses pglogical.

7. This is very specific when you are migrating from Oracle that has many and large LOB segments. If your target Cloud SQL PostgreSQL or AlloyDB is in Version 14 and above. To make the initial load faster by 3x times,
change the default_toast_compression in target Cloud SQL PostgreSQL or AlloyDB to LZ4.

The CLOBs and BLOBs in Oracle are converted to TEXT and BYTEA respectively in PostgreSQL. If the LOBs are large then it is extremely likely that the size of tuple/row > 2KB and they will be spilled to TOAST segments (store out-of-line) in PostgreSQL (they will be stored in pg_toast schema as pg_toast_<OIDoftable>). TOAST data is compressed/decompressed while being inserted/queried. The default compression technique that PostgreSQL uses is PGLZ which is CPU Intensive and not as performant as LZ4 which is available from PostgreSQL 14 onwards. Using LZ4 the SELECTs speed is close to that of uncompressed data, and the speed of data insertion is upto 80% faster compared to PGLZ. Additionally you will get faster performance during SELECTs.

Target Cloud SQL PostgreSQL Instance Sizing and Storage

More resources you give to the target Cloud SQL Instance, the better the performance of DMS will be.

Network Throughput, Disk Throughput and Disk IOPS. Network Throughput is limited by the number of vCPUs- we get 250MBps Network throughput per vCPU and the Disk Throughput (0.48MBps per GB) is limited by Network Throughput. For Disk IOPS we get 30 IOPS/GB.

So, the correct Instance size, along with the storage size will help you improve the DMS Initial Load performance. In general DMS will need more IOPS and decent Disk throughput and you can configure your disk size in such a way that you utilize as much as Network throughput Bandwidth for Disk Throughput (as most of the network bandwidth consumed will be from Database VM to underlying storage).

Take for example, for a 4 vCPU Cloud SQL Enterprise Instance you will get 1000 MB/s as Network throughput. So if you allocate a 600GB Disk you will get Disk Throughput close to 300 MB/s and 18000 IOPS. (I am not taking into account your Database size, of course you need to allocate more storage than your database size)

So do not size the initial storage based on the source database size only, take into account the Throughput and IOPS requirement of the workload.

You can always later reduce storage using either a request with Google Support team or via Self Service Storage Shrink which is in Preview mode now. Target Cloud SQL Instance can be downscaled before the application cut-over.

Few More Tips

Do not create a Regional Cloud SQL Instance during the Migration time. Enable High Availability, if you need so, after the migration is done and before application cut-over.

Do not enable Automated Backups during the time of migration.

DMS does not create Secondary Indexes and Constraints during Initial Load; it creates after the initial load completes and before CDC.

Install pg_wait_sampling extension which will be helpful in diagnosing wait events related to PostgreSQL slow performance during Migration and even after production cut-over. Query pg_stat_bgwriter, pg_stat_wal for information on Checkpoints and Commits which can be used to diagnose further. Enable the log based alerts and log based Metrics related to Frequent checkpoints.

--

--

Somdyuti
Google Cloud - Community

A Database Learner for 21 years. DM Specialist in Google helping customers to migrate and modernize their databases to GCP Databases-CloudSQL, AlloyDB, Spanner