MySQL to Cloud Spanner: Migrate MySQL database to Cloud Spanner using Vertex AI notebooks and GCP Dataproc Serverless

Surjit Singh
Google Cloud - Community
4 min readOct 21, 2022

Looking for a prebuilt solution for parallel migration of MySQL tables to Cloud Spanner? look no further. We have developed a Vertex AI notebook (Jupyter notebook) solution which uses dataproc serverless for migration. You just need to provide necessary parameters and the notebook will help you migrate a complete MySQL database to Cloud Spanner.

Dataproc Templates using VertexAI notebook and Dataproc Serverless provides a one stop solution to migrate data directly from MySQl to Cloud Spanner.

1. Enable below services in GCP project from API Console:

  • Compute Engine API
  • Dataproc API
  • Vertex-AI API
  • Vertex Notebooks API

2. Create a User-Managed Notebook in Vertex AI Workbench.

3. Clone Dataproc Template GitHub repo using the GIT tab as shown in the below screenshot

or open a terminal from the launcher window and clone using git clone.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git

4. From the folders tab, open MySqlToSpanner_notebook.ipynb notebook present in the path: dataproc-templates/notebooks/mysql2spanner

5. Notebook contains a step by step process to help migrate data from MySQL to Cloud Spanner. Below are the steps for migration:

Step 1: Install required packages

Some packages needed for the migration need to be installed separately as they are not available in the notebook e.g. pymysql SQLAlchemy, JDK, etc.

Step 2: Import Libraries : Import the required packages

Step 3: Assign Parameters : Below configurations need to be set before running the notebook:

Step 3.1: Common Parameters

  • PROJECT : GCP project-id
  • REGION : GCP region
  • GCS_STAGING_LOCATION : GCS staging location to be used for this notebook to store artifacts
  • SUBNET : VPC subnet
  • JARS : list of jars. For this notebook mysql connector and avro jar is required in addition with the dataproc template jars
  • MAX_PARALLELISM : Parameter for number of jobs to run in parallel default value is 2

Step 3.2: MYSQL to Spanner Parameters

  • MYSQL_HOST : MYSQL instance ip address
  • MYSQL_PORT : MySQL instance port
  • MYSQL_USERNAME : MYSQL username
  • MYSQL_PASSWORD : MYSQL password
  • MYSQL_DATABASE : name of database that you want to migrate
  • MYSQLTABLE_LIST : list of tables you want to migrate eg: [‘table1’,’table2'] else provide an empty list for migration whole database eg : []
  • MYSQL_OUTPUT_GCS_LOCATION : gcs location where mysql output will be writtes eg :”gs://bucket/[folder]”
  • MYSQL_OUTPUT_GCS_MODE : output mode for MYSQL data one of (overwrite|append)
  • MYSQL_OUTPUT_GCS_FORMAT : output file format for MYSQL data one of (avro|parquet|orc)
  • SPANNER_INSTANCE : cloud spanner instance name
  • SPANNER_DATABASE : cloud spanner database name
  • SPANNER_TABLE_PRIMARY_KEYS : provide dictionary of format {“table_name”:”primary_key”} for tables which do not have primary key in MYSQL

Step 3.3: Notebook Configuration Parameters

These parameters contain paths for jars and properties required for Dataproc job. These parameters are not required to be changed unless you want to change the location of jars.

Step 4: Generate MySQL Table List

This step generates the list of tables for migration. All tables from the database are selected if MYSQLTABLE_LIST parameter in step 3.2 is an empty list.

Step 5: Get Primary Keys for tables not present in SPANNER_TABLE_PRIMARY_KEYS

This step gets the primary keys for tables present in MYSQLTABLE_LIST parameter and if primary keys for these tables is not provided in parameter SPANNER_TABLE_PRIMARY_KEYS in step 3.2

Step 6: Get Row Count of Tables and identify read partition column

This step uses PARTITION_THRESHOLD(default value is 200k) parameter and any table having rows greater than PARTITION_THRESHOLD will be partitioned based on Primary Key

Step 7: Calculate Parallel Jobs for MySQL to Cloud Spanner

Based on the MAX_PARALLELISM parameter set in step 3.1, This step calculates parallel jobs for MySQL to GCS migration and GCS to Cloud Spanner. Basically the number of MySQL tables that will be migrated in parallel to Cloud Spanner.

Step 8: Create JAR files and Upload to GCS

This step creates required JAR files and uploads it to GCS_STAGING_LOCATION defined in step 3.1

Step 9: Execute Pipeline to Migrate tables from MySQL to Spanner

This step starts the Vertex AI pipeline execution for MySQL table migration. Pipeline link is generated for each job. We can also see the job running in Dataproc UI in Batches section

Step 10: Get status for tables migrated from MySql to Spanner

This step gets the status of jobs that are executed in step 9

Step 11: Validate row counts of migrated tables from MySQL to Cloud Spanner

This steps helps in validating numbers of records of each table that is migrated from MySQL to Cloud Spanner

For any queries/suggestions reach out to: dataproc-templates-support-external@googlegroups.com

--

--