Reduce DB upgrade downtime to less than 10 minutes using DMS on Google Cloud
Disclaimer: Views, thoughts, and opinions expressed in the blog belong solely to the author, and not necessarily to the author’s employer, organisation, committee or other group or individual.
This blog post is for you, if you are running Google Cloud SQL for critical applications and intend to reduce the downtime of upgrading the database. Database (DB) upgrades, most of the time are the requirements to fix bugs, adopt new features, compliance, most importantly, keep your application data secured and bring flexibility for applications to leverage the new features. At the times, if DB upgrades are not done well in-time, then it might lead to high impact issues. Often operation teams spend significant times to plan and test the upgrades, to ensure least downtimes and minimal impact on dependent apps and services. Scope of this blog shows how Google Cloud Data Migration Service (DMS) can reduce the downtimes of DB upgrades.
Google cloud DMS is serverless logical replication service which simplifies replication between a source and target database. Users can override complexity behind setting up replications such as Server, configuration management etc.
Considering the critical production environment, I’ll walk-through all the steps required to first create source env and then perform DB upgrades using DMS. Feel free to ignore the steps to commission the source env, while you can test and derive your strategy using the rest of steps. Here is the high level of summary for this blog post on -how DMS is used to upgrade a Cloud SQL PostgreSQL, however same strategy can also be used to upgrade the Google Cloud SQL MySQL DB instances:
- Create Source Database environment — Google Cloud SQL PostgreSQL
- Emulate application environment using CloudSQL proxy on GCE (Google Compute Engine) — This section also shows how to secure the network communication between Cloud SQL instance and Application tier.
- Configure DMS replication which implicitly creates Target Cloud SQL PostgreSQL Database Instance with intended DB version and kick-start the replications
- Validate DMS replication — Full and CDC (Change data capture)
- Cutover — transition to higher version Target DB and move app endpoint
- Cost consideration
- Best practices, known issues & limitations and additional references
1 — Create Source Database environment
1.1 — Assumptions and prerequisite
This section is optional when it comes to production/dev environments because you must be having source databases which you intend to upgrade. This section is helpful for those who want to test this procedure from scratch and understand all nuts and bolts for the entire process of upgrading the database using the replication methodology, DMS. Following are assumptions for this blog:
- You should already have Google cloud environment configured and operational, if not then follow-on https://cloud.google.com/gcp .
- While we are limiting the scope of this blog to PostgreSQL, however methodology recommended to upgrade the database is also applicable to Cloud SQL MySQL databases.
- Databases and applications are both run in a private network leveraging VPC private subnet.
- Since Databases need secure communications, all network communications are done using private IPs. for example: the communication between Database and Application tiers are done using private IPs.
- Database engine used is PostgreSQL, Source version is PostgreSQL 10 and intent is to upgrade the database to target PostgreSQL engine 13.
- Sample Database, Schema is derived from open source PostgreSQL community recommended sample database — “Airlines demo database”.
- Application tier prep work must have been accomplished such as any DB level procedures and compatibility test with respect to target database version.
- The GCP region considered in this blog is Mumbai — asia-south1 and not using default VPC which is not recommended to use in production and secured environments. Following screenshot shows that there is a VPC created with the name “sh-vpc” and a subnet in the Mumbai region as “subnet-mumbai”. Take a look at Subnet IP range configuration. Please remember the same network configuration is used in this blog.
1.2 — Create Source Cloud SQL PG (PostgreSQL) Database
In this section, we will create a Cloud SQL PostgreSQL database in Mumbai region, with minimal configurations which can imitate the source production database environment. In Cloud Console, go to the Cloud SQL page: Go to Cloud SQL and create Cloud SQL postgreSQL 10 database with configuration shown in following screenshots. Please record the password as this will be required later on steps.
Choose Machine type per your requirements. You might want to test and store more data than what is shown below, so choose per your requirements.
Please choose wisely your network configuration. With our assumptions and VPC configuration, we are limiting network configuration to private and un-checking Public option.
Once you are done with all the options for source database, create the instance:
After a while, the DB instance will be listed on the console as ready status. Record the private IP address allocated to the instance. This private IP address will be required in subsequent steps for DB connectivity. Remember, this has been allocated under the limits of the IP range configuration setup under VPC subnet configuration.
2 — Emulate Application environment with Cloud SQL proxy agent
2.1 — Setup GCE VM
In this section, we will set up the GCE VM in the same subnet Cloud SQL DB instance, this limits the scope of communication between client and database to single subnet as shown in following figure.
Compute engine is used as an application tier to connect to the database and this can also be assumed as a jump server to connect to database instances. In the cloud console, go to VM instance page and set up the VM as shown in the following screenshot. Check to select Mumbai as region and match Zone with database to prevent any latencies.
Next, choose the network as the private, similar to the configuration taken up for the database and disable the public network along with the ephemeral connections. This will ensure all the connection between this VM and database instances will only route through the private subnet and make communication more secure.
Wait a while after you choose to create the instance, it will appear as below on the VM instance console page.
For the ssh option, you can choose the “Open in browser window” option to login to VM instance.
After you login to ssh window, patch VM instance with latest version of Cloud SDK and install PostgreSQL client — psql — as shown below in screenshot:
sudo apt-get update && sudo apt-get install google-cloud-sdk
sudo apt -y install postgresql-client
2.2 Connect VM and database:
In this section, we will deploy Google Cloud SQL proxy agent as database endpoint and emulate database as local to application tier running on same VM. For this blog post, we are assuming psql application tier.
If you have not recorded database private IP, then go back to Cloud SQL cloud console and copy database instance private IP address as given as 10.45.193.4, use username and database provided during creation of database:
Use private IP on VM ssh to connect to DB as shown in screenshot.
psql -h 10.45.193.4 -U postgres postgres
We will emulate app environment database connectivity using cloud SQL proxy which provides secure connections, automatically encrypts traffic to and from the database using TLS with a 128-bit AES cipher, independent of database protocols and you won’t need to manage SSL certificates. It also uses IAM permissions to control who and what can connect to your Cloud SQL instances. The Cloud SQL Auth proxy runs as a local client and our client tool psql communicates with the Cloud SQL Auth proxy with the standard database protocol, which inturn uses a secure tunnel to communicate with its companion process running on the database server. Each connection established through the Cloud SQL Auth proxy creates one connection to the Cloud SQL instance. SQL proxy acts as DB endpoint for psql client application and this emulates our application environment. Next, let’s download and install the Cloud SQL proxy on GCE VM.
Download the Cloud SQL Auth proxy:
wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
Make the Cloud SQL Auth proxy executable:
chmod +x cloud_sql_proxy
Connect database using Cloud SQL proxy:
go back to Cloud SQL cloud console and select your PG instance to check connection name and use same in cloud SQL proxy as shown in screenshot:
./cloud_sql_proxy -instances=<<connection name>>=tcp:5432 &
./cloud_sql_proxy -instances=shailesh-1:asia-south1:pg-db1=tcp:5432 &
Cloud SQL proxy is recommended to run as background service, however for simplicity it is shown here to run as background command. Now, connect to the database through Cloud SQL proxy using localhost IP in psql since cloud proxy is running and forwarding connection to database on default port 5432 on localhost. This method hides the network complexity from the application tier and imitates the database as local.
psql -h 127.0.0.1 -U postgres postgres
Once you make a connection, see below entry of connection in Cloud SQL proxy log.
Now, we will populate a sample database “demo”, schema “bookings” and a set of tables. It is downloaded from the open-source postgresql community page and sample DB is “Airline demo database”. We are deploying a bigger variant of a sample database which is more than 2.5 GB of data with schema structure consisting of complex data types such as “jsonb”, primary and forien key relationships as typical represent an OLTP database.
Run the extracted file as sql file connecting database using psql and cloudSQL proxy:
Once SQL script is completed, database “demo”, schema “bookings” and its set of tables are created. Let’s validate.
Count the number of rows of each tables created.
2–3 Configure DMS prerequisite on source DB
Go back to the Cloud SQL console page and select source DB “pg-db-1” and select Edit to change mandatory flags, this will enable logical replication and decoding for the instance.
Switch ON the logical_decoding and enable_pglogical as shown in the following screenshot on edit page.
Enable pglogical extension on all DBs to make sure nothing left behind to replicate, enabling on demo and postgres DB.
Make sure below parameters are set of at least these default values, otherwise review the appropriate value per DMS doc.
Grant the replication role to connecting user POSTGRES in DMS tasks. Please review the above mentioned DMS doc page for additional configuration.
Lets review the user permission:
Restart the DB instance.
DMS does not support certain features like tables without a primary key. You have to mitigate this using surrogate keys as prescribed in documentation. Review if you have any more complexities listed here, please resolve before you proceed.
3 — Create DMS replication task and kick-start replication
In this section, we create DMS instance to kick-start the replication:
3.1 — Create DMS connection profiles
In order to create a DMS replication task, it is mandatory to have source database configuration. Go to the DMS Cloud Console page and in the left panel choose “Connection Profiles”. Choose Create Profile at top followed by selecting “Cloud SQL for PostgreSQL” from drop down since our source database is the one we created in previous section:
Next, provide all the configuration relevant to the source database. In our case, it is shown in the below screenshot and choose to create a source database connection profile which can be used in one or more DMS migration tasks.
3.2 — Create DMS job and kick-start replication
In this section, we will create a DMS replication job using source database connection profile and kick-start the database replication. This step is comprehensive because it validates database and network configurations required to establish a replication.
Go to DMS Console page and choose “Migration Job” from left panel and click on “Create Migration Job”
In “Get Started” Step — specify a unique job name and choose the source DB profile created in the previous section. Make sure you choose the right region, as in our case the region is “Mumbai” and “Continuous” as a Migration Job type. This will enable Change Data Capture between source and target database until the job is marked to be completed. Continuous (sometimes referred to as ongoing or online) migration is a continuous flow of changes from your source to your destination that starts with an initial full dump and load. Bottom part is a prerequisite which is a voluntary section consisting of an “Open” link. This link guides you to check all prerequisites to configure the replication. Once all requirements are met for source DB and connectivity, move forward to the next step.
On the next step “Define a Source”, select the connection profile created in the previous section and move forward to “Create a destination”.
In the “Create a destination” step, make sure to specify the target PG version you intend to upgrade and other configuration such as credentials and instance name. Once you continue to move forward, DMS will attempt to create the DB instance.
You can check progress of new DB instance creation on Cloud SQL console page:
On DMS job console page and on step “Define connectivity method”, specify connectivity method as below in screenshot because we are using private IP between DB and App host.
Once a New/Target DB instance is created, you can continue to the next step to validate all configurations with “test migration” step.
If the above step is not successful, then fulfill the prerequisite configuration as prescribed in DMS doc .
Finally, go with “CREATE & START JOB” to kickstart migration right away because at this moment in our case we don’t have any reason to wait but in your case if you have any reason to wait then feel to free to go with “CREATE JOB” and you can start DMS job later on when you are ready.
Once the DMS job is started, monitor it until it changes its status transition from “starting” through “Full Dump in progress” to “CDC in progress”.
Keep monitoring DMS and DB operations tabs to see if there are any bottlenecks.
Once migration job status reaches “CDC in progress”, you are good to validate the replication and do a cutover to target DB, however you can continue to run in CDC status as long as you want and are not ready to cutover the DB. Let’s validate if all rows copied from source to target.
3.3 Validate DMS replication
Find out target DB private IP, connect using psql and change the prompts name to simplify the validations.
Validate DB and schema creations
Next, get the count of rows in target.
Connect to source DB and change prompt name to simplify the validation:
Get and Validate count of rows in source and target DB. All good !!
Let’s validate the CDC (Change Data Capture) by inserting one row in one of the source table and again match the number of rows with target tables.
All good as we can see after inserting one row the total row count remains as expected, hence the CDC is working fine and we are good to do cutover to target DB.
3.4 Cutover to Target DB and measure downtime for app tier.
Lets measure the downtime on app tier while performing the cutover. Note downtime the start time as below:
Before you start, please check there should not be any CDC lag and this can be checked as one of the operation metrics of our DMS job.
As we emulated app tier using cloud SQL proxy, we will shut it down so that there is no more connection from app to DB as follows. Since it is running as background process, we will kill Cloud SQL proxy which is still taking connection to old/source DB
Next, we will stop replication and promote target DB as standalone, out of DMS replication. Go to the DMS console page and use the PROMOTE option to gracefully stop the CDC and promote Target DB. Status of DMS will change as below
Wait until below the state of “Completed….”.
If you check Cloud SQL console page, you will see target DB as standalone.
Now bring up Cloud SQL proxy and start receiving connections to target DB. For this check the connection name of target DB from Cloud SQL console page, click “pg-db-13”, on overview tab, note Connection name:
Use the same connection in bringing up Cloud SQL proxy as shown in below screenshot.
Using Cloud SQL proxy, connect to target DB and validate new upgraded version and set of table data.
Next you can declare the end of downtime and we will record the time again. This shows less than 10 min of downtime.
Below was start time
You can still reduce the downtime further by automating steps instead of doing it manually because target DB promotion is less than 1 min as observed in operation log.
Make sure old is unavailable and best is to lock the users and shut it down.
Database Migration Service is offered at no additional charge for native migrations to Cloud SQL. As long as you are bearing the cost of the existing database, there is no cost except if any egress charge is applicable which likely to be not the case unless you plan to move the DB out of the current region, Zone or VPC subnet. There’s no additional charge for network ingress from external source databases. However, costs can be incurred outside of Google such as platform egress charges.
Best practices, common traps and additional references:
Please review the limitations of DMS and seek out any more complexities. However there are few known issue as following:
- Review and match your source database configuration on how to prepare the source DB for DMS replication is here.
- DDL Changes between source and destination needs special attention, please refer here.
- If source DB has a complex scenario such as table without primary key, materialised view, large objects, then refer here.
- While creating a DMS job, make sure the target instance must have all the required configuration such storage, CPU and network configuration.
- Since networking scope of this blog post is limited to private networks, follow-on this reference for complex networking scenario for DMS configuration.
- Since this blog post is a superset use-case of homogeneous database migration, please review its best practices.
Call to action
This is another use-case where Replication can help to reduce the DB upgrade downtime. In this blog post, we observed downtime is reduced to less than 10 min using DMS and still there is potential to reduce the downtime by orchestrating and automating all the possible steps. I will encourage you to try out this strategy because DMS is free to use and you will have only to bear the cost of DB and data ingress or outgres charges if applicable.
Special thanks to my team mate — Sourabh Gupta who reviewed this blog.
Love to hear your feedbacks, Cheers !!