MsSQL to BigQuery : Migrate efficiently using Vertex AI notebook and GCP Dataproc Serverless
Do you wish to thoroughly analyse your data without interfering with SQL Server’s production configuration? The best course of action in such instance would be to transfer data from SQL Server to a reliable data warehouse, such as Google BigQuery.
Dataproc Templates, in conjunction with VertexAI notebook and Dataproc Serverless, provide a one-stop solution for migrating data directly from MsSQL Database to GCP BigQuery.
This narrative is also applicable to other structured database sources, such as an Oracle database or a MySQL database as we’ve developed several notebooks depending upon the use case . Dive into Dataproc repository to know more .
Before we begin, here are some added benefits of this notebook:
- Automatically generates list of tables from metadata. Alternatively, user can supply list of tables.
- Identifies current primary key column name, and partitioned read properties.
- Automatically uses partition reads if exceeds threshold.
- Divides migration into batches and parallely migrates multiple tables.
- Notebook allows you to choose modes i.e. appending data or overwrite.
- Bigquery load automatically creates table if the table does not exists.
Setup VertexAI Notebook:
- Enable below services in GCP project from API Console:
- Compute Engine API
- Dataproc API
- Vertex-AI API
- Vertex Notebooks API
2. Create a User-Managed Notebook in Vertex AI Workbench.
3. Clone Dataproc template repository using the GIT tab as shown in the below image
or open a terminal from launcher window and clone using git clone.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
4. From the folders tab, open MsSql-to-BigQuery-notebook.ipynb notebook present in the path: dataproc-templates/notebooks/mssql2bq
5. The steps that follow are implemented in the Python notebook with sequential markings for the users’ convenience.
Step 1: Install required packages
Some of the packages needed for the migration needs to be installed separately as they are not available in the notebook e.g. PySpark, JDK, etc.
Step 2: Import Libraries :
This notebook requires the MSSQL and POSTGRES connector jars. Installation information is present in the notebook
Step 3: Set Google Cloud properties
Below configurations need to be set before running the notebook:
Common Parameters
PROJECT : GCP project-id
REGION : GCP region
GCS_STAGING_LOCATION : GCS staging location to be used for this notebook to store artifacts
SUBNET : VPC subnet
JARS : list of jars. For this notebook mssql and Spark Bigquery connector jars are required in addition to the dataproc template jars
MAX_PARALLELISM : Parameter for number of jobs to run in parallel default value is 5
SQL Server Parameters
SQL_SERVER_HOST : SQL Server instance ip address
SQL_SERVER_PORT : SQL Server instance port
SQL_SERVER_USERNAME : SQL Server username
SQL_SERVER_PASSWORD : SQL Server password
SQL_SERVER_DATABASE : Name of database that you want to migrate
SQL_SERVER_TABLE_LIST : List of tables you want to migrate eg ['schema.table1','schema.table2'] else provide an empty list for migration of specific schemas or the whole database. Example: [].
SQL_SERVER_SCHEMA_LIST : List of schemas. Use this if you'ld like to migrate all tables associated with specific schemas eg. ['schema1','schema2']. Otherwise, leave this parameter empty. Example: []. This comes in handy when don't want to provide names of all the tables separately but would rather prefer migrating all tables from a schema.
BigQuery Parameters
BIGQUERY_DATASET : BigQuery Target Dataset
BIGQUERY_MODE : Mode of operation at target append/overwrite
Step 4: Generate MsSQL Table List or MsSQL Schema List
This step generates the list of tables/schema for migration. All tables from the database are selected if SQL_SERVER_TABLE_LIST /SQL_SERVER_SCHEMA_LIST parameter is an empty list.
Step 5: Calculate Parallel Jobs for MsSQL to BigQuery
This step calculates parallel jobs for MsSQL to GCS migration & GCS to BigQuery migration based on the MAX_PARALLELISM parameter. The number of MsSQL tables that will be migrated to BigQuery in parallel.
Step 6: Create JAR files and Upload to GCS
This step creates required JAR files and uploads it to GCS_STAGING_LOCATION
defined in prior steps.
Step 7: Execute Pipeline to Migrate tables from MsSql to BigQuery
This step starts the Vertex AI pipeline execution for MsSql table migration. Pipeline link is generated for each job. We can also see the job running in Dataproc UI in Batches section.
Run all the cells one by one or all at once using the option given in Menu bar. In the end, a link will get generated for each table set which can be used to monitor the jobs. Alternatively, detailed logs can be seen from Dataproc Batch UI as well.
→ Get in touch with dataproc-templates-support-external@googlegroups.com : for any questions or recommendations.
Until next time , take care & stay connected :)