Online Schema Migration with MySQL on GCP (Cloud SQL)

rerorero
Nerd For Tech
Published in
5 min readMar 3, 2021

This is a note on trying online schema migration with MySQL on Cloud SQL. I tried those two tools:

  • pt-online-schema-change
  • gh-ost

Terms

I tried them with MySQL under the following conditions:

Online DDL

The first thing that comes to mind for when performing online migration is online DDL. However, there are limitations.

The main advantage of online DDL is that you don’t need to rely on any additional tools, as MySQL provides all of them for you. Under certain conditions, you can access the table being migrated (Concurrent DML) and perform an in-place migration that doesn’t require a large amount of disk space or I/O.

However, there is a major concern regarding Online DDL: even if you run a query with LOCK=None MySQL will only perform it online on the master, blocking the replicas from applying DML until the master completes your DDL. Therefore, running DDL on a large table can cause significant replication lag.

In addition, it is not possible to throttle or pause the migration process, and so the load on the master cannot be controlled.

These problems are not specific to Cloud SQL, but to MySQL (InnoDB) in general.

If you don’t care about load or replication lag, online DDL might be a good option, but it wouldn’t be a common case if you run a service that offers a higher service level. So, I recently tried out two online migration tools to see if they could be used with Cloud SQL.

pt-online-schema-change (pt-osc)

Percona’s pt-osc is an online schema migration tool that has been popular for many years.

It creates a new table with the desired schema, then starts copying rows from the original table. It also creates triggers to mirror the changes that occurred in the original table during the migration process to the new table. Finally, pt-osc renames the table, switching between the original table and the new table.

Run pt-osc with Cloud SQL

When I tried to run pt-osc with Cloud SQL, I had to pay a touch of attention to the following points.

1) log_bin_trust_function_creators must be enabled

As pt-osc creates triggers, it requires a user with SUPER previlige but this is impossible in Cloud SQL. Instead, you can enable log_bin_trust_function_creators in your instance to allow all users to create stored functions and triggers. You can find how to change database flags on Cloud SQL here.

2) Explicitly specify the replica with --check-slave-lag

pt-osc monitors replication lag, pauses and throttles itself if it detects a lag greater than --max-lag . pt-osc tries to find replicas by runningSHOW SLAVE HOSTS and SHOW PROCESSLIST on the master by default. However, as long as you have pt-osc connect to MySQL servers via Cloud SQL Proxy, pt-osc should not be able to reach the host information retrieved from the MySQL server. You would see the following warning message:

No slaves found. See — recursion-method if host localhost has slaves.
Not checking slave lag because no slaves were found and — check-slave-lag was not specified.

pt-osc provides a way of explicitly passing replica host information. I was able to get it to recognize replica by adding a flag like —-check-salve-lag D=foo,h=127.0.0.1,P=33001. To see if this has succeeded, you can check the log.

Will check slave lag on:
localhost -> 127.0.0.1:33001

Example of pt-osc command execution

$ pt-online-schema-change \
--alter=“DROP COLUNN bar ”D=test,t=foo,h=127.0.0.1,P=33000,u=root,p=pass \
--check-slave-lag="D=test,h=127.0.0.1,P=33001,u=root,p=pass" \
--print \
--execute

gh-ost

gh-ost is designed as TRIGGER-less to avoid problems caused by triggers. You can see here what kind of problems triggers could cause.

Online migration with gh-ost goes like this: gh-ost creates a new table with the desired schema and copies rows from the original table to the new one. To mirror the changes that occurred in the original table to the new table, gh-ost connects to MySQL server and receives binlog events as though it were a replica server. It then extracts the changes from the binlog events and applies them to the new table in the master. Finally, it performs the final phase called cut-over, where it switches the tables.

Thanks to the triggerless design, gh-ost provides a more flexible throttling and some interactive commands that can control its behavior without pausing the migration process.

Run gh-ost with Cloud SQL

When I tried to run gh-ost with Cloud SQL, I had to pay a touch of attention to the following points.

1) run with --allow-on-master flag

With gh-ost, it is recommended the binlog events be read from a replica in order to reduce the load on the master. It also offers a way to safely test the migration process, where you don’t have to touch the master server by having gh-ost connect to the replica (with — migrate-on-replica or--test-on-replica option).

However, as Cloud SQL doesn't allow to create a replica of a replica, there is no way to have gh-ost receive binlogs from a replica with Cloud SQL. (As mentioned here, it might be possible with an external replica, though)

To have gh-ost connect to the master, --allow-on-master flag is required. Please see this document for more details.

2) run with --gcp flag

I faced the following error when running via Cloud SQL Proxy without --gcp option:

FATAL Unexpected database port reported: 3306

The document says it’s for the 1st generation, but it seems to be required if you run it via a tunnel connection (i.e. if the port specified for the connection is different from the port that is actually being served).

Example of gh-ost command execution

$ gh-ost \
--max-load=Threads_running=25 \
--critical-load=Threads_running=1000 \
--chunk-size=1000 \
--throttle-control-replicas=127.0.0.1:33001 \
--max-lag-millis=3000 \
--user=root \
--password=pass \
--host=127.0.0.1 \
--port=33000 \
--allow-on-master \
--database=test \
--table=foo \
--verbose \
--alter="DROP COLUMN bar" \
--default-retries=120 \
--panic-flag-file=/tmp/host.panic.flag \
--serve-socket-file=/tmp/ghost.test.foo.sock \
--gcp \
--execute

Afterword

I found that both tools somehow work with Cloud SQL. The next step is to run it in my production environment. I will update this note if I find any issues specific to Cloud SQL with schema migration throughout operations.

Works Cited:

--

--