Minimal Downtime Database Migration from MariaDB Galera Cluster to CloudSQL for MySQL at Production —An Overview to Google Cloud Database Migration Service

Ridwan Fajar
NiceDay Development
10 min readJul 1, 2022

--

Overview

Database migration is not an easy task while keep the infrastructure stable and less risky. It might led to some critical risks such as downtime for services or data loss during the migration. By using minimal downtime migration, we could reduce downtime risk to our service into very minimum time. Also by using kind of replication technique, we could prevent data loss from our database.

Therefore, there is a tool called Database Migration Service from Google Cloud Platform (GCP) that could help us to minimize downtime and data loss risk during the database migration. I will try to explain the concept of minimal downtime database migration by using GCP Database Migration Service based on our real database migration in production from MariaDB Galera Cluster to CloudSQL for MySQL.

Let’s go to the next section.

A. Objectives and motivations

There are some objectives behind this migration. First, we want to have resilient and secure MySQL database setup for our chat service. Second, we want infrastructure team to have more time on building a better platform for NiceDay developers.

Since we are going to let Google Cloud team manage our production databases, we could focus on another aspects such as building automation for our operational tasks, automation to fasten development cycle, maintain site reliability for our NiceDay services and many more.

Beside the objective, we also have some motivations that led us to perform this migration such as:

  • Maintainability, CloudSQL offer some features to support the objective such as auto-resize disk, high availability node, replica node, managed security patch, update OS easily, etc
  • Get rid of outdated software, our MariaDB distribution which are currently running is an outdated software. Also for the Ubuntu16.04 which host the database. So, it is better to rebuilding the database with the new version of MySQL or MariaDB and latest OS version rather than keep our outdated software for long term
  • Interoperability, moving to CloudSQL help us to achieve higher interoperability. That means when we want to move to another cloud such as AWS or Azure, we could migrate the database easily by using their database migration service
  • Easier integration with other GCP services, CloudSQL are well integrated with other GCP services . So it could level up our cloud infrastructure in GCP.

B. Existing chat database setup

Our MariaDB Galera cluster is self-hosted database by Infrastructure Team and hosted at Google Cloud Platform. You can see the overview in the below diagram:

High level diagram for chat database setup at NiceDay Cloud Infrastructure

There were three services which pointing to MariaDB. They were Backend, Chat and Airflow. Chat service is Ejabberd server. The service perform read and write operation. Backend and Airflow are perform read operation only.

We are using MariaDB Galera Cluster which consists of three nodes as master-master replication. MariaDB version is 10.1.0 (released at November 2018) which is equivalent to MySQL 5.7.0 at the time this article written. All Galera nodes run under Ubuntu 16.04 which has already outdated.

The main thing that challenge us to perform zero downtime database migration is the number of chat records. Total size of exported chat database is more than 50 GB per June 2022. Total rows in archive table is almost 48 million rows of chat (markers, last seen, chat messages, etc.).

C. Migration option

Before we jump to the actual migration, we compare some options such as lift-and-shift migration and Database Migration Service. Both of them are possible to performed. Let’s take a look into detail for each options.

C.1. Lift and shift migration

As it is name it is just lift the old database and move to the new destination. We only need to perform database export then import the exported database into new destination. Really simple and straightforward. We conducted research to two different approach of lift-and-shift migration. Here we go!

C.1.1. MySQL CLI

By performing this operation, we only need to export database from source then import the database to new destination. Unfortunately, this approach required downtime for Backend and Chat service. The downtime took less than 4 hours in Production-Testing environment.

lift-and-shift migration by using mysqldump and mysql-cli

It’s too bad and might consume all error budget in our SLA for Backend and Chat service downtime. So, we didn’t choose this approach.

C.1.2. CloudSQL Import Tools

It’s similar to previous option. But it has additional step that should send the backup file to Cloud Storage. Then CloudSQL Import Tools perform database import by reading the backup file from CloudSQL. Unfortunately, this approach also required downtime for Backend and Chat service. Moreover, the downtime took less than 5 hours in Production-Testing environment.

lift-and-shift migration by using mysqldump and CloudSQL Import Tools

It’s more bad than previous option. So, we didn’t choose this approach.

C.2. Database migration service

Luckily, GCP offer a good service for migrating database with minimal downtime approach. We were interesting to use this service. But unfortunately, MariaDB currently is not supported yet.

List of supported source databases for GCP Database Migration Service

Basically, we assume MariaDB could be a source for Database Migration Service. But it isn’t and really unfortunate. If we could use this service, zero downtime database migration is possible 100%.

Assuming that Database Migration Service worked for MariaDB. But it isn’t.

We want to achieve zero downtime for database migration. So without stopping the traffic (except switchover), our database could be synced from MariaDB Galera Cluster to CloudSQL in real time. But how it is possible if GCP doesn’t support it officially?

During the discussion with infrastructure team internally, we found an experimental idea to add MySQL as intermediary between MariaDB source to Database Migration Service. Check the diagram below:

Adding MySQL chain replica between MariaDB source and Database Migration Service

MariaDB support slave replication to MySQL officialy, that’s the point that really help us. But how MySQL slave node could be a source for Database Migration Service?

The answer is by performing chain replica. So within the slave node, we enable it as master node in the same place. By using this approach, Database Migration Service is able to read the MySQL chain replica node as if it’s master source. Finally, we’ve found this applicable in Production-Testing environment.

How about the implementation in the Production environment? Let’s go to the next section.

D. Switchover step by step — an illustration

Technically, we choose Database Migration Service as our possible solution to perform zero downtime database migration for our MariaDB Galera Cluster to CloudSQL MySQL.

We performed the switchover outside SLA hours which exactly took at 09.30 AM GMT+7 until 10.00 AM GMT+7 (Jakarta Local Time) at Sunday, 12th June 2022. Our SLA hours is 09.00 AM GMT+7 until 11.00 PM GMT+7 (Jakarta Local Time + Central European Time) from Monday to Friday. So it would be safer to perform the switchover during weekend.

D.1.1 Preparing the migration

Before the switchover, we have prepared the database migration by deploying Database Migration Service at Friday, 10th June 2022. The longest part of database migration itself is preparing the chain replication. We have to enable:

  • standard master node in on of our MariaDB Galera nodes
  • deploy new VM with MySQL 5.7.0 installed on it
  • connecting the MySQL node as slave to the MariaDB node
  • enable master node inside MySQL slave node to become chain replica node
  • deploy Database Migration Service by using MySQL chain replica as as ource database
  • let the Database Migration Service performing CDC operation to the CloudSQL replica node

It’s a quite tough preparation. But we are happy to see this was really work. The chat messages from MariaDB synced to CloudSQL replica node. That is indicating that the Database Migration Service has worked and by using chain replication has became a possible solution even if it’s bit experimental.

When Database Migration Service perform synchronization for the first time, we didn’t see any downtime there for Backend and Chat service. We could assume that the database migration was zero downtime. Even there was a lag during the synchronization, CloudSQL replica node was able to catch up with latest changes from MariaDB node. It’s really cool!

How about the switchover itself? let’s move to the next section.

Step 1 — Deploy Database Migration Service

D.1.2 Switchover for Backend and Chat service

We performed the switchover at Sunday, 12 June 2022 from 09.30 AM GMT+7 until around 10.00 AM GMT+7. Before we jump to stop related services, we checked our daily load testing results which finished two hours before the switchover schedule. Usually, we compare the load testing result before and after major changes to our cloud infrastructure.

We stopped traffic from Backend and Chat service. Also stopping Airflow from performing many DAGs within it. We ensured there was no incoming traffic to all MariaDB nodes by using netstat command and show processlist SQL query. We also check a bit whether the CloudSQL replica node is ready to be promoted.

Step 2 — Start switchover by stopping related services

Then, we promoted CloudSQL replica node to become master node from Database Migration Service console page. We were waiting until it finished.

Step 3 — Promote CloudSQL replica node to become master node

Finally, promote master task finished. Database Migration Service stopped replication from source database. CloudSQL MySQL master node became an independent instance and ready for incoming traffics.

Step 4 — Ensure that the CloudSQL become master node

Last step of this switchover was change IP Address for chat database from MariaDB node IP address to CloudSQL MySQL master node IP address. We changed the IP address through database proxy service and no need for rewriting the codebase or redeploy the service.

Step 5 — Switchover chat database IP Address at Backend and Chat service to use CloudSQL Master Node IP Address

During the switchover, Backend service was experiencing downtime for 8 minutes. Chat service was experiencing downtime for 12 minutes. It’s still better than both services should experience downtime for more than 4 hours if we choose lift-and-shift migration.

D.1.3 Enabling high availability

To make the CloudSQL MySQL highly available, we enabled standby node in different zone. Just in case there is an outage in current zone (europe-west4-a), the standby node could be triggered by fail over mechanism as GCP mentioned in the CloudSQL documentation.

Step 7 — Enable high availability for CloudSQL MySQL

We enabled this feature still on Sunday, 12th June 2022. Unfortunately, there was additional downtime for Backend service when enabling HA feature. Chat service wasn’t affected at all.

D.1.4 Add replica node

We added replica node to separate loads from production services and ETL services. Airflow have some scheduled DAGs. Those DAGs are quite intensive operation and might block production services (Backend and Chat).

The replica node is also useful if data team or external researcher want to explore chat database without interrupting the production services.

Step 8 — add new replica node for CloudSQL MySQL

We added this replica at Sunday, 12th June 2022. Backend and Chat service weren’t affected by this step.

D.1.5 Switchover for Airflow

Finally, Airflow was the last service that get switchover. Airflow now pointing replica node of new chat database. So it won’t interrupt the master node from Backend or Chat service.

Step 9 — Airflow switchover from old MariaDB to new CloudSQL for MySQL

The downtime took 24 hours, because we need to perform some validations with data team in the Monday. The migration itself was performed at Sunday morning Jakarta Local Time (GMT+7) and it was outside our office hours. But the downtime is acceptable because existing ETL tasks are not streaming operation.

After we fixed some issues, Airflow has back at Monday, 13 June 2022.

E. Summary

Finally, we have migrated our chat database to CloudSQL for MySQL successfully. By using Database Migration Service, the database migration duration is totally 0 ms.

During switchover, it still took around 12 minutes. Also there was some additional downtime for 5 minutes for Backend service when enabling HA feature. Again, It is still better than by using lift-and-shift strategy which might take 4–5 hours.

After the switchover, we performed load testing to the new setup. We saw that there was no different result when comparing before and after using CloudSQL MySQL.

The final architecture is described as in below diagram:

For high availability, we enabled HA feature in CloudSQL. To separate workload request from Airflow, we configure Airflow to read replica node only. Then, master node will be only read and write by chat and backend service.

There is also room for improvement such as:

  • Read operation might be load balanced between master or replica node.
  • Finding another better switchover technique that minimize downtime

How about the actual implementation? I will create another post that cover the detail of the migration implementation. See you soon.

References

Special thanks to Ricky (Team Lead) and William from Infrastructure Team who helped me a lot while in discussion to make this migration possible. And they were also helping me during the switchover at the weekend. They are really nice guys.

Also thanks to stakeholders at NiceDay such as Umar (CTO), Merel (Product Owner), Alison (Product Support Team), Jorge (Data Engineer Lead), Pambo (Backend Engineer Lead), Syahmia (QA Lead), Kiky (QA), Harry (QA Internship) and Khalid (QA Automation Engineer) who involved in this migration to coordinate during the migration and led the migration to minimum risks. Superb!

--

--