SQL Server Data Movement with CDAP
SQL Server is one of the most popular and common relational databases systems used by enterprises for their wide ranging data storage, analysis and reporting needs. However data residing in SQL Server needs to be moved, copied and integrated with other data sources (sometimes residing in other databases that can be on either on-prem or Cloud) to connect siloed data sources together and drive faster path to insights. CDAP makes it extremely easy to copy or move data from SQL Server into other databases or systems.
In this article, we will show you how quickly and easily you can copy your SQL server data to a Cloud Data Warehouse product (BigQuery in this case). Using the same SQL Server source, you can also copy it into other cloud or on-prem sinks.
SQL Server setup
We will be using a SQL Server 2016 instance running on Google Cloud Platform for this exercise. You can review the instructions here to quickly setup a SQL Server instance on Google Cloud Platform. Once started, you can retrieve RDP information for the SQL Server 2016 instance from Google Cloud Platform console > Compute Engine > VM Instances
WIth your choice of Remote Desktop software, you can connect to the SQL Server instance with the username and password that you setup on the instance to connect to the SQL server database.
We will be using AdventureWorks database (a sample database provided by Microsoft to show how to design SQL Server databases) for this exercise. The instructions provided here can help you install AdventureWorks database from .bak file. Please note that we used Data Warehouse bak file (AdventureWorksDW2016.bak) for this exercise that has dimensions and fact tables under same schema (dbo).
We will be copying over Product dimension table (DimProduct) table over to BigQuery using CDAP.
CDAP is an integrated, open source application development platform that provides developers with data and application abstractions to simplify and accelerate application development, address a broader range of real-time and batch use cases, and deploy applications into production while satisfying enterprise requirements. You can get started with CDAP with one of these deployment options.
SQL Server Source Setup
Within CDAP instance, click on the top-left hamburger and go to “Wrangler”. This section allows you to wrangle with the data with directives so that you can apply basic transformations to the data. This section also allows you to build and test connections. Click on “Add Connection” and then upload SQL server database driver (You can see below 7.2.2 version was uploaded for this exercise).
You can then setup SQL Server connection and verify it using SQL server connection box. The format of JDBC connection string is: jdbc:sqlserver://<hostname>:<port>;databaseName=<database;user=<username>
Please also make sure the SQL Server instance is accepting connections at port 1433.
You can now browse datasets from the database right in the CDAP UI. You can view the data, wrangle it and gain insights into the columns. We will be using Product dimension from AdventureWorks database. Screenshot below shows how easily and quickly you can wrangle with SQL server data using CDAP’s Wrangler feature.
Once you have performed the needed wrangling, you can click on “Create a Pipeline” button to create a batch pipeline (The next screen will ask you to select the type of the pipeline and you will see only “Batch Pipeline” option being available as SQL Server database is a batch source).
The next step is to complete the pipeline with the BigQuery sink. Go to “Sink” section on the left-side panel of the Studio canvas and select “BigQuery” to be added to the pipeline.You will need to specify BigQuery properties (by clicking on “Properties” button on BigQuery sink) such as Reference name, service account file path, Dataset and Table where you would like to copy SQL Server data. You will need to add the data flow between Wrangler output and BigQuery input. Finally, you must give the pipeline a name and save it. The fully designed pipeline would like below:
CDAP also allows you to perform a Preview run of the pipeline to ensure that it runs as expected. Please note that the Preview run will only run with 100 rows and will not write data to the pipeline sink.
Now that everything looks good in the Preview run, we can deploy the pipeline and run it. Pipeline run would invoke an ephemeral Dataproc (Google Cloud’s managed Hadoop and Spark service), run the pipeline and tear down the Dataproc cluster..
You can verify that the data from Product dimension from SQL Server database has been copied over to BigQuery under AdventureWorksDW2016 dataset by either reviewing BigQuery dataset directly from within the CDAP Wrangler or visiting BigQuery UI.
In this exercise, you saw how intuitive and quick it was to setup the data movement pipeline with CDAP using SQL Server as the source and BigQuery as the sink. Using this exercise is just a starting point, you can leverage CDAP to copy/move data from multiple on-premise and Cloud sources to numerous on-prem/Cloud sinks. This multi-cloud and hybrid approach to data integration makes it really easy to build and deploy enterprise data pipelines.
I invite you to explore additional sources and sinks within CDAP, and learn more about CDAP’s other powerful features. Stay tuned for more of similar solutions where we will show you how to leverage other sources and sinks along with few powerful transforms. Also try out the managed version of CDAP — Cloud Data Fusion on Google Cloud Platform if you would like to take advantage of CDAP, without any deployment or management effort.