MySQL to Cloud Spanner: Migrate MySQL database to Cloud Spanner using Vertex AI notebooks and GCP Dataproc Serverless
Looking for a prebuilt solution for parallel migration of MySQL tables to Cloud Spanner? look no further. We have developed a Vertex AI notebook (Jupyter notebook) solution which uses dataproc serverless for migration. You just need to provide necessary parameters and the notebook will help you migrate a complete MySQL database to Cloud Spanner.
Dataproc Templates using VertexAI notebook and Dataproc Serverless provides a one stop solution to migrate data directly from MySQl to Cloud Spanner.
1. Enable below services in GCP project from API Console:
- Compute Engine API
- Dataproc API
- Vertex-AI API
- Vertex Notebooks API
2. Create a User-Managed Notebook in Vertex AI Workbench.
3. Clone Dataproc Template GitHub repo using the GIT tab as shown in the below screenshot
or open a terminal from the launcher window and clone using git clone.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
4. From the folders tab, open MySqlToSpanner_notebook.ipynb notebook present in the path: dataproc-templates/notebooks/mysql2spanner
5. Notebook contains a step by step process to help migrate data from MySQL to Cloud Spanner. Below are the steps for migration:
Step 1: Install required packages
Some packages needed for the migration need to be installed separately as they are not available in the notebook e.g. pymysql SQLAlchemy, JDK, etc.
Step 2: Import Libraries : Import the required packages
Step 3: Assign Parameters : Below configurations need to be set before running the notebook:
Step 3.1: Common Parameters
PROJECT
: GCP project-idREGION
: GCP regionGCS_STAGING_LOCATION
: GCS staging location to be used for this notebook to store artifactsSUBNET
: VPC subnetJARS
: list of jars. For this notebook mysql connector and avro jar is required in addition with the dataproc template jarsMAX_PARALLELISM
: Parameter for number of jobs to run in parallel default value is 2
Step 3.2: MYSQL to Spanner Parameters
MYSQL_HOST
: MYSQL instance ip addressMYSQL_PORT
: MySQL instance portMYSQL_USERNAME
: MYSQL usernameMYSQL_PASSWORD
: MYSQL passwordMYSQL_DATABASE
: name of database that you want to migrateMYSQLTABLE_LIST
: list of tables you want to migrate eg: [‘table1’,’table2'] else provide an empty list for migration whole database eg : []MYSQL_OUTPUT_GCS_LOCATION
: gcs location where mysql output will be writtes eg :”gs://bucket/[folder]”MYSQL_OUTPUT_GCS_MODE
: output mode for MYSQL data one of (overwrite|append)MYSQL_OUTPUT_GCS_FORMAT
: output file format for MYSQL data one of (avro|parquet|orc)SPANNER_INSTANCE
: cloud spanner instance nameSPANNER_DATABASE
: cloud spanner database nameSPANNER_TABLE_PRIMARY_KEYS
: provide dictionary of format {“table_name”:”primary_key”} for tables which do not have primary key in MYSQL
Step 3.3: Notebook Configuration Parameters
These parameters contain paths for jars and properties required for Dataproc job. These parameters are not required to be changed unless you want to change the location of jars.
Step 4: Generate MySQL Table List
This step generates the list of tables for migration. All tables from the database are selected if MYSQLTABLE_LIST
parameter in step 3.2 is an empty list.
Step 5: Get Primary Keys for tables not present in SPANNER_TABLE_PRIMARY_KEYS
This step gets the primary keys for tables present in MYSQLTABLE_LIST
parameter and if primary keys for these tables is not provided in parameter SPANNER_TABLE_PRIMARY_KEYS
in step 3.2
Step 6: Get Row Count of Tables and identify read partition column
This step uses PARTITION_THRESHOLD
(default value is 200k) parameter and any table having rows greater than PARTITION_THRESHOLD
will be partitioned based on Primary Key
Step 7: Calculate Parallel Jobs for MySQL to Cloud Spanner
Based on the MAX_PARALLELISM
parameter set in step 3.1, This step calculates parallel jobs for MySQL to GCS migration and GCS to Cloud Spanner. Basically the number of MySQL tables that will be migrated in parallel to Cloud Spanner.
Step 8: Create JAR files and Upload to GCS
This step creates required JAR files and uploads it to GCS_STAGING_LOCATION
defined in step 3.1
Step 9: Execute Pipeline to Migrate tables from MySQL to Spanner
This step starts the Vertex AI pipeline execution for MySQL table migration. Pipeline link is generated for each job. We can also see the job running in Dataproc UI in Batches section
Step 10: Get status for tables migrated from MySql to Spanner
This step gets the status of jobs that are executed in step 9
Step 11: Validate row counts of migrated tables from MySQL to Cloud Spanner
This steps helps in validating numbers of records of each table that is migrated from MySQL to Cloud Spanner
For any queries/suggestions reach out to: dataproc-templates-support-external@googlegroups.com