Minimal Downtime Database Migration from MariaDB Galera Cluster to CloudSQL for MySQL at Production — An Implementation Guide
A. Overview
Previously, I wrote an article similar to the title of this article but in more high level explanation. Now, I want to explain the implementation and configuration of MySQL chain replica. You can look into that article below:
After you finished with that article, you may go back to this article and move to the next section. Hope you get the context from previous article.
Happy reading!
B. Configure chain replication
Let say the MariaDB node that will be used for master node to chain replication is called mariadb-dqs2 and the name of MySQL chain replica VM is mysql-chain-replica-1.
Let’s go to the next section.
B.1. Configure one of MariaDB node as master node
Add these lines to /etc/mysql/my.conf to enable master configuration in mariadb-dqs2. You can use text editor such as VIM or Nano.
# master node configurationserver-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = ejabberd
binlog_format = row
bind-address = 10.0.3.47
log_slave_updates=ON
Restart mysql service inside mariadb-dqs2 from the console.
$ sudo systemctl restart mysql
Create new MariaDB user for replication in mariadb-dqs2 through MySQL console.
MariaDB [(none)]> SET SESSION SQL_LOG_BIN=0;
MariaDB [(none)]> CREATE USER 'replica'@'%' IDENTIFIED BY '<< password placeholder >>';
MariaDB [(none)]> GRANT REPLICATION SLAVE ON *.* TO 'replica'@'%';
MariaDB [(none)]> GRANT EXECUTE ON *.* TO 'replica'@'%';
MariaDB [(none)]> GRANT SELECT ON *.* TO 'replica'@'%';
MariaDB [(none)]> GRANT SHOW VIEW ON *.* TO 'replica'@'%';
MariaDB [(none)]> GRANT REPLICATION CLIENT ON *.* TO 'replica'@'%';
MariaDB [(none)]> GRANT RELOAD ON *.* TO 'replica'@'%';
MariaDB [(none)]> SET SESSION SQL_LOG_BIN=1;
MariaDB [(none)]> SHOW GRANTS FOR replica@'%'\G;
*************************** 1. row ***************************
Grants for replica@%: GRANT SELECT, RELOAD, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, SHOW VIEW ON *.* TO 'replica'@'%' IDENTIFIED BY PASSWORD '<< password placeholder >>'
1 row in set (0.00 sec)
B.2. Create database backup for slave node
Desync mariadb-dqs2 from Galera Cluster through MySQL console.
MariaDB [(None)]> SET GLOBAL wsrep_desync = ON;
Check if desync enabled for mariadb-dqs2 through MySQL console.
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%desync%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync | ON |
+---------------+-------+
1 row in set (0.00 sec)MariaDB [(none)]> SHOW GLOBAL STATUS LIKE '%desync%';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_desync_count | 1 |
+--------------------+-------+
1 row in set (0.01 sec)
Dump chat database from mariadb-dqs2 from the console.
$ time mysqldump -h 10.0.3.47 -u ejabberd -p --master-data=2 --single-transaction --quick --opt ejabberd > latest.sql
Please create a new service account from your GCP administrator. The service account should have read and write access to bucket that will store you initial backup file. This idea come when I couldn’t transfer the backup file from mariadb-dqs2 VM to mysql-chain-replica-1 VM.
It would be better if I could find another way to transmit the file at that time. But because to limited time, I wasn’t able the effective way to send file to another VM. Maybe by allowing the firewall to make the file transferable to mysql-chain-replica-1 by using scp. Very sorry.
Upload backup file in mariadb-dqs2 to Google Cloud Storage temporarily from the console.
$ GCP_ACCOUNT_NAME=cloudsql-migration@sense-health.iam.gserviceaccount.com
$ GCP_KEY_FILE=/home/ridwan/cloudsql-migration.json
$ GCP_PROJECT_ID=sense-health
$ gcloud auth activate-service-account $GCP_ACCOUNT_NAME --key-file=$GCP_KEY_FILE --project=$GCP_PROJECT_ID
$ gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M cp latest.sql "gs://bucket-production/mariadb-dqs2/"
B.3. Restore database backup at slave node
Once the backup file upload done, you may download latest backup file from Google Cloud Storage to mysql-chain-replica-1.
$ GCP_ACCOUNT_NAME=cloudsql-migration@sense-health.iam.gserviceaccount.com
$ GCP_KEY_FILE=/home/ridwan/cloudsql-migration.json
$ GCP_PROJECT_ID=sense-health
$ gcloud auth activate-service-account $GCP_ACCOUNT_NAME --key-file=$GCP_KEY_FILE --project=$GCP_PROJECT_ID
$ gsutil cp gs://bucket-production/mariadb-dqs2/latest.sql .
Restore chat database to mysql-chain-replica-1.
$ time mysql -u ejabberd ejabberd -p < latest.sql
It would take sometime until the database restore task finish.
B.4. Configure slave node
Get master_log_file and master_log_pos values from latest backup file downloaded to mysql-chain-replica-1.
$ head -n 50 latest.sql | grep MASTER_LOG_FILE-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=4009680;
Add those configuration to /etc/mysql/mysql.conf.d/mysqld.cnf in mysql-chain-replica-1.
server-id = 2
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = ejabberd
binlog_format = row
bind-address = 10.0.3.49
relay-log = /var/log/mysql/mysql-relay-bin.log
log_slave_updates = ON
sync_binlog = 1
read_only = 1
That configuration would enable mysql-chain-replica-1 as slave for mariadb-dqs2 (master node from Galera) and source database (master node also) for GCP Database Migration Service in the single instance. This is why we may call it as chain replication.
Now, restart mysql service in mysql-chain-replica-1.
$ sudo systemctl restart mysql
Set value for MASTER_LOG_FILE and MASTER_LOG_POST based from previous output in initial step for this section. Then, enable slave configuration for mysql-chain-replica-1 through MySQL console.
mysql> SET SESSION SQL_LOG_BIN=0; STOP SLAVE;
mysql> CHANGE MASTER TO
mysql> MASTER_HOST='10.0.3.47',
mysql> MASTER_USER='replica',
mysql> MASTER_PASSWORD='<< password >>’',
mysql> MASTER_LOG_FILE='mysql-bin.000001',
mysql> MASTER_LOG_POS=4009680;
mysql> START SLAVE; SET SESSION SQL_LOG_BIN=1;
Check slave status of mysql-chain-replica-1. If it’s successfully applied, you may see the output such this:
mysql> SHOW SLAVE STATUS \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.0.3.47
Master_User: replica
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
....Master_Server_Id: 1
Master_UUID:
Master_Info_File: /var/lib/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
Check slave status at mariadb-dqs2 through MySQL console.
MariaDB [ejabberd]> show master status;
+------------------+----------+------------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+------------------+------------------+
| mysql-bin.000001 | 4029680 | ejabberd | |
+------------------+----------+------------------+------------------+
1 row in set (0.00 sec)MariaDB [(none)]> show slave hosts;
+-----------+------+------+-----------+
| Server_id | Host | Port | Master_id |
+-----------+------+------+-----------+
| 2 | | 3306 | 1 |
+-----------+------+------+-----------+
1 row in set (0.00 sec)
Disable desync for mariadb-dqs2 through MySQL console as we’ve finished with the slave configuration and let mariadb-dqs2 sync with other Galera nodes again.
MariaDB [(none)]> SET GLOBAL wsrep_desync = OFF;
Check chat dataset rows number between mariadb-dqs2 and mysql-chain-replica-1 by using MySQL console to ensure if the replication is applied as we expected:
mysql> select count(*) from archive;
+----------+
| count(*) |
+----------+
| 48052497 |
+----------+
1 row in set (9.24 sec)__________________________________________________________________MariaDB [ejabberd]> select count(*) from archive;
+----------+
| count(*) |
+----------+
| 48052497 |
+----------+
1 row in set (9.34 sec)
Congratulations! we are done with configuring chain replication between MariaDB node and MySQL node. Let’s move to the next section.
B.5. Configure slave node as source for GCP Database Migration Service
Check if mysql-chain-replica-1 is acting as master node.
mysql> show master status;
+------------------+----------+------------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+------------------+------------------+-------------------+
| mysql-bin.000014 | 154 | ejabberd | | |
+------------------+----------+------------------+------------------+-------------------+
1 row in set (0.00 sec)
Create replica_chain user from mysql-chain-replica-1 for GCP DMS.
mysql> SET SESSION SQL_LOG_BIN=0;
mysql> CREATE USER 'replica_chain'@'%' IDENTIFIED BY ‘<< password >>’';
mysql> GRANT REPLICATION SLAVE ON *.* TO 'replica_chain'@'%';
mysql> GRANT EXECUTE ON *.* TO 'replica_chain'@'%';
mysql> GRANT SELECT ON *.* TO 'replica_chain'@'%';
mysql> GRANT SHOW VIEW ON *.* TO 'replica_chain'@'%';
mysql> GRANT REPLICATION CLIENT ON *.* TO 'replica_chain'@'%';
mysql> GRANT RELOAD ON *.* TO 'replica_chain'@'%';
mysql> SET SESSION SQL_LOG_BIN=1;
We are almost done. Let’s move to configure GCP Database Migration Service section.
C. Configure GCP Database Migration Service
C.1. Create Database Migration Service
Create new database migration service. Just follow through some forms to create the service. In the Get Started form, you will need to fill about migration job information.
In Defining a source form, you have to specify the source database information from IP Address, database name, replication user and many more. Use replica_chain user when defining source.
In Create a destination form, you have to specify the destination database specification from machine type, network that located the destination database. Create mysql-production for destination of the migration service
Under Define connectivity method form, you have to connect the CloudSQL replica node that we have, mysql-production, with source database mysql-chain-replica-1 through VPC Peering under “Production” network.
Under Test and create migration job form, CloudSQL will be created right away and we can still configure the database migration service. Test the connection between source and destination. You may test and create the migration job.
Sometime, you might face network issue caused by firewall rule. Even if Database Migration Service is able to create the replica node. You still have to allow the IP address of CloudSQL replica node to access the source database. Then test the connection once again before you finalize the Database Migration Service job.
Once the job created, you can check your connection profile whether is it correct or no. You can edit or delete connection profile you’ve specified when creating Database Migration Service. You might also get an error when testing the connection between CloudSQL replica node (destination) and source database caused by incorrect connection profile.
C.2. Run Database Migration Service
Once the database migration service successfully connecting source and destination. Database Migration Service will perform full dump database operation. It may take sometime depend on the size of your database. Since the database size was 50 GB, the full dump task took around 1 hour and half.
But it wasn’t affecting the production service for suffering downtime. It was running without issue at all.
Once full dump operation finished, Database Migration Service will turn the operation state into CDC that make source database synced with CloudSQL Replica as long we are not promoting the CloudSQL Replica into a master node.
Every new chat messages which send to MariaDB cluster will be synced to CloudSQL replica node.
D. Switchover
D.1. Switch over step by step
To read more know about this section, you may jump into my previous article about the overview of minimal downtime database migration with illustration gudie:
After read that article, you may continue to following section.
D.2. Load testing result
After the switch over, our backend and chat service at production environment are pointing to new chat database at CloudSQL MySQL. Before the switch over, we performed load testing to backend and chat service. Here is the result:
- Time: Sunday, 12th June 2022 08:09 GMT+7
- Active users: 100
- Total requests: 31773 requests
- Rate: 56.41 requests / seconds
- Total errors: 1515 requests (expected, there is bug in the load testing script)
- Error rate: 0.05 requests / seconds
- Min duration: 0.79 ms
- Max duration: 3084.96 ms
- Average duration: 29.89 ms
- Median duration: 6.04 ms
- Percentile 90th duration: 91.93 ms
- Percentile 95th duration: 155.03 ms
After switch over, we performed load testing also and here is the result:
- Time: Sunday, 12th June 2022 10:39 GMT+7
- Active users: 100
- Total requests: 31773 requests
- Rate: 55.62 requests / seconds
- Total errors: 1494 requests (expected, there is bug in the load testing script)
- Error rate: 0.05 requests / seconds
- Min duration: 0.71 ms
- Max duration: 535.71 ms
- Average duration: 47.27 ms
- Median duration: 5.95 ms
- Percentile 90th duration: 161.32 ms
- Percentile 95th duration: 182.59 ms
From the result above, we can conclude that we don’t have anomaly after the switch over. We could consider the migration is safe and didn’t led our backend and chat service to downtime or failure.
We also perform the continuous testing in the morning every day since 12th June when we performed the switch over. We saw there aren’t anomaly at all.
D.3. Smoke testing result from QA Team
On the next Monday 13th June 2022, we got ton of helps from QA team. Both for manual testing and automated testing. We got report from Khalid (QA Automation Engineer) for testing web portal and mobile app by using their automated test suite solution. Here is the report:
- Environment: Production
- iOS test cases: 100% passed
- Android test cases: 100% passed
- Firefox (web browser) test cases: 100% passed
- Chrome (web browser) test cases: 100% passed
Kiky and Harry also reported to us after performed smoke testing on web portal and mobile app by following their manual test procedureds:
- Environment: Production
- iOS test cases: 100% passed
- Chrome (web browser) test cases: 100% passed
E. Migration results
At 12th June 2022, we perform the Switchover by promoting CloudSQL Replica into Master node. After the promotion, Database Migration Service state turned into “Complete”. Then Promote, Restart, Stop and Resume button became disabled after the promotion operation.
Here is the final result of promoting replica node become master node. The master node has:
- MySQL v5.7.x
- Auto disk resize is enabled
- Automated backups are enabled
- PITR is enabled
- Regional highly-available is enabled. Standby node is located in different zone. The zone of master node is europe-west4-c
- Some database flags are added for optimizing RAM usage into 80% cached datasets (innodb_log_buffer_size, innodb_buffer_pool_size, innodb_log_file_size, innodb_write_io_threads)
- CPU (8 cores), RAM (32 GB) and SSD storage (100 GB as starter) are specified similar to MariaDB node specification
The master node also has it’s replica node. Here is specification for the replica node:
- MySQL v5.7.x
- Auto disk resize is enabled
- Automated backups are enabled
- PITR is enabled
- Regional highly-available is enabled
- Some database flags are added for optimizing RAM usage into 80% cached datasets (innodb_log_buffer_size, innodb_buffer_pool_size, innodb_log_file_size, innodb_write_io_threads)
- CPU (8 cores), RAM (32 GB) and SSD storage (100 GB as starter) are specified similar to MariaDB node specification
F. Conclusion
By using MySQL standard replication between MariaDB node and MySQL node, we could use Database Migration Service to perform zero downtime database migration from MariaDB Galera Cluster to CloudSQL for MySQL.
The database migration from mysql-chain-replica-1 to CloudSQL replica node took 0 minutes or zero downtime. But the switch over duration took around 12 minutes because we need to reconfigure the backend configuration and enabling HA took downtime. But we performed this switch over outside SLA hours (weekend, Saturday morning).
It’s really make me happy. Thanks to Infrastructure Team!
References
- Minimal Downtime Database Migration from MariaDB Galera Cluster to CloudSQL for MySQL at Production — An Overview to Google Cloud Database Migration Service
- GCP Database Migration Service Quickstart
- Using MariaDB Replication with MariaDB Galera Cluster
- 7 Steps to Setup Daisy Chain Replication of MySQL 5.1 to 5.7 to 8.0
- How to get last log position from MySQL Master Node
Special thanks to Ricky (Team Lead) and William from Infrastructure Team who helped me a lot while in discussion to make this migration possible. And they were also helping me during the switchover at the weekend. They are really nice guys.
Also thanks to stakeholders at NiceDay such as Umar (CTO), Merel (Product Owner), Alison (Product Support Team), Jorge (Data Engineer Lead), Pambo (Backend Engineer Lead), Syahmia (QA Lead), Kiky (QA), Harry (QA Internship) and Khalid (QA Automation Engineer) who involved in this migration to coordinate during the migration and led the migration to minimum risks. Superb!