MsSQL to BigQuery : Migrate efficiently using Vertex AI notebook and GCP Dataproc Serverless

Raevat Kumar
Google Cloud - Community
4 min readFeb 7, 2023

--

Do you wish to thoroughly analyse your data without interfering with SQL Server’s production configuration? The best course of action in such instance would be to transfer data from SQL Server to a reliable data warehouse, such as Google BigQuery.

Dataproc Templates, in conjunction with VertexAI notebook and Dataproc Serverless, provide a one-stop solution for migrating data directly from MsSQL Database to GCP BigQuery.

This narrative is also applicable to other structured database sources, such as an Oracle database or a MySQL database as we’ve developed several notebooks depending upon the use case . Dive into Dataproc repository to know more .

Before we begin, here are some added benefits of this notebook:

  1. Automatically generates list of tables from metadata. Alternatively, user can supply list of tables.
  2. Identifies current primary key column name, and partitioned read properties.
  3. Automatically uses partition reads if exceeds threshold.
  4. Divides migration into batches and parallely migrates multiple tables.
  5. Notebook allows you to choose modes i.e. appending data or overwrite.
  6. Bigquery load automatically creates table if the table does not exists.

Setup VertexAI Notebook:

  1. Enable below services in GCP project from API Console:
  • Compute Engine API
  • Dataproc API
  • Vertex-AI API
  • Vertex Notebooks API

2. Create a User-Managed Notebook in Vertex AI Workbench.

3. Clone Dataproc template repository using the GIT tab as shown in the below image

or open a terminal from launcher window and clone using git clone.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git

4. From the folders tab, open MsSql-to-BigQuery-notebook.ipynb notebook present in the path: dataproc-templates/notebooks/mssql2bq

5. The steps that follow are implemented in the Python notebook with sequential markings for the users’ convenience.

Step 1: Install required packages

Some of the packages needed for the migration needs to be installed separately as they are not available in the notebook e.g. PySpark, JDK, etc.

Step 2: Import Libraries :

This notebook requires the MSSQL and POSTGRES connector jars. Installation information is present in the notebook

Step 3: Set Google Cloud properties

Below configurations need to be set before running the notebook:

Common Parameters

PROJECT : GCP project-id

REGION : GCP region

GCS_STAGING_LOCATION : GCS staging location to be used for this notebook to store artifacts

SUBNET : VPC subnet

JARS : list of jars. For this notebook mssql and Spark Bigquery connector jars are required in addition to the dataproc template jars

MAX_PARALLELISM : Parameter for number of jobs to run in parallel default value is 5
SQL Server Parameters

SQL_SERVER_HOST : SQL Server instance ip address

SQL_SERVER_PORT : SQL Server instance port

SQL_SERVER_USERNAME : SQL Server username

SQL_SERVER_PASSWORD : SQL Server password

SQL_SERVER_DATABASE : Name of database that you want to migrate

SQL_SERVER_TABLE_LIST : List of tables you want to migrate eg ['schema.table1','schema.table2'] else provide an empty list for migration of specific schemas or the whole database. Example: [].

SQL_SERVER_SCHEMA_LIST : List of schemas. Use this if you'ld like to migrate all tables associated with specific schemas eg. ['schema1','schema2']. Otherwise, leave this parameter empty. Example: []. This comes in handy when don't want to provide names of all the tables separately but would rather prefer migrating all tables from a schema.
BigQuery Parameters

BIGQUERY_DATASET : BigQuery Target Dataset

BIGQUERY_MODE : Mode of operation at target append/overwrite

Step 4: Generate MsSQL Table List or MsSQL Schema List

This step generates the list of tables/schema for migration. All tables from the database are selected if SQL_SERVER_TABLE_LIST /SQL_SERVER_SCHEMA_LIST parameter is an empty list.

Step 5: Calculate Parallel Jobs for MsSQL to BigQuery

This step calculates parallel jobs for MsSQL to GCS migration & GCS to BigQuery migration based on the MAX_PARALLELISM parameter. The number of MsSQL tables that will be migrated to BigQuery in parallel.

Step 6: Create JAR files and Upload to GCS

This step creates required JAR files and uploads it to GCS_STAGING_LOCATION defined in prior steps.

Step 7: Execute Pipeline to Migrate tables from MsSql to BigQuery

This step starts the Vertex AI pipeline execution for MsSql table migration. Pipeline link is generated for each job. We can also see the job running in Dataproc UI in Batches section.

Run all the cells one by one or all at once using the option given in Menu bar. In the end, a link will get generated for each table set which can be used to monitor the jobs. Alternatively, detailed logs can be seen from Dataproc Batch UI as well.

→ Get in touch with dataproc-templates-support-external@googlegroups.com : for any questions or recommendations.

Until next time , take care & stay connected :)

--

--