Azure Synapse Analytics — Data migration

Published in

CodeX

7 min readApr 21, 2021

This article was written by the Betacom Analytics and Innovation team. Betacom is a company based in Italy and Germany which operates in the IT field using innovative technologies, digital solutions and cutting-edge programming methodologies. You can find out more on our website.

Introduction

In this article we will discuss how to migrate data to Azure Synapse Analytics. We assume you are already familiar with the Azure Synapse Analytics service. If not, please refer to our previous articles: Introduction to Azure Synapse Analytics and Dedicated SQL pools in Azure Synapse Analytics.

Overview

Azure Synapse Analytics uses threads to read data in parallel (DW Readers) and send it to the writer threads (DW Writers).

The first thing you should be aware of is that your DWUs have a direct impact on how fast you can load data in parallel. The following table summarizes the relationship among DWUs, DW Readers and DW Writers:

Let’s see some best practices you should follow.

Optimize the Insert Batch Size. The ideal size should be 1 million rows since it compresses well and handles the columnstore segment (i.e. the columnstore index archiviation unit).
Avoid ordered data since they slow down the process. Indeed data ordered by distribution key can introduce hotspots that slow down the load operation.
Use temporary heap tables via the “CREATE TABLE #temp_table” command.
Use “CREATE TABLE AS”. It is minimally logged and allows fully parallel operation and to change distribution, table type and partitioning.

Load methods

The two main kinds of loading methods are Single Client and Parallel Readers.

The Single Client loading methods use tools such as SSIS (SQL Server Integration Services), Azure Data Factory and BCP (Bulk Copy Program). Be aware that adding capacity in parallel results in bottlenecks on the control node. It means you should not use it for large data.

Source: Dedicated SQL pool (formerly SQL DW) architecture — Azure Synapse Analytics

The Parallel Readers loading methods use PolyBase to read from an external source (Azure Blob Storage or Azure Data Lake Store) and load into Azure SQL DW. PolyBase skips the control node and loads directly into the compute nodes. It can load data from UTF-8 delimited text files and popular Hadoop file formats such as RC File, ORC, and Parquet.

Source: Dedicated SQL pool (formerly SQL DW) architecture — Azure Synapse Analytics

PolyBase demo

In this section we will discuss how to load a dimension table from on-premises SQL Server Management Studio to Azure Synapse dedicated SQL pool using PolyBase.

The steps to migrate data with PolyBase will be the following:

Export the table from SSMS into a flat file;
Create a blob storage account;
Load the flat file into the blob storage;
Setting PolyBase;
Monitor and confirm the migration success;
Confirm 60 distributions into the final table.

Let’s see them in detail.

Export the table into a flat file (see also Steps in the SQL Server Import and Export Wizard)

Log into SQL Server Management Studio.
Right click on the database, go to Tasks and select Export Data. This will open the SQL Server Import and Export Wizard.
Select “SQL Server Native Client” as Data Source and check that the Server name and the database are the correct ones.
Click on Next, choose “Flat File Destination” as your destination and choose your file name. Be sure to check or uncheck “Column names in the first data row” according to your data.
Click on Next and select the option you need. If you need to filter data while exporting it into a flat file, you probably want the “Write a query” option.
Click on Next and select the table you want to export. Leave the other settings to the default ones.
Click on Next and select “Run immediately” if you only need to export the table at once. Otherwise you could create a SSIS procedure to schedule the export. It will not be covered in this tutorial. Click on Next and then Finish.

Create a blob storage account
All the steps are available at Create a storage account — Azure Storage. They can be summarized as follows:

Sign in to the Azure portal;
On the Azure portal menu, select All services and then in the list of resources select Storage Accounts;
Choose Add;
Fill in the creation form according to your needs and select Create.

Load the flat file into the blob storage
There are two ways to do so: directly into a new container or from the Storage Explorer, a free tool from Azure which you can find here. The tutorial to load the flat file via Storage Explorer is available here.

Setting PolyBase
Go to the dedicated SQL pool and follow the next steps. For convenience, all the steps are available here as well.

Create a master key by executing CREATE MASTER KEY; GO. The master key is required to encrypt the credential secret in the next step.
Create a database scoped credential. Fill in the query by inserting any string as “IDENTITY” (it is not used for authentication to Azure storage) and your Azure storage account key as “SECRET”.
CREATE DATABASE SCOPED CREDENTIAL BlobStorageCredential WITH IDENTITY = '', SECRET = '' ; GO
Create an external data source. In the “LOCATION” string insert ‘wasbs://CONTAINER_NAME@STORAGE_ACCONT_NAME.blob.core.windows.net’ where CONTAINER_NAME and STORAGE_ACCONT_NAME are your container and storage account names respectively. Also note that PolyBase uses Hadoop APIs to access data.
CREATE EXTERNAL DATA SOURCE AzureBlobStorage WITH ( TYPE = HADOOP, LOCATION = '', CREDENTIAL = BlobStorageCredential );
Create an external file format. Esecute the following query where FIELD_TERMINATOR marks the end of each field (column) in a delimited text file, STRING_DELIMITER specifies the field terminator for data of type string in the text-delimited file, DATE_FORMAT specifies a custom format for all date and time data that might appear in a delimited text file, and USE_TYPE_DEFAULT store missing values as default for datatype.
CREATE EXTERNAL FILE FORMAT CSVFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = ',' , STRING_DELIMITER = '' , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss' , USE_TYPE_DEFAULT = FALSE ) ); GO
Create an external table. Here’s the meaning of each field from the following query:
LOCATION is the folder under the Data Lake Storage root folder,
DATA_SOURCE specifies which Data Source Object to use,
FILE_FORMAT specifies which File Format Object to use,
REJECT_TYPE specifies how to deal with rejected rows (VALUE or percentage of the total),
REJECT_VALUE sets the reject value based on the reject type.
Now insert your location and the columns of your table and execute the query:
CREATE SCHEMA [stage]; GO; CREATE EXTERNAL TABLE [stage].tableName ( [columnName] [columnType] (NOT) NULL ... ) WITH ( LOCATION = '' , DATA_SOURCE = AzureBlobStorage , FILE_FORMAT = CSVFileFormat , REJECT_TYPE = VALUE , REJECT_VALUE = 0 ) GO
Load from the external table. The CTAS (CREATE TABLE AS) creates a new table and populates it with the results of a select statement. The new table has the same columns and data types as the results of the select statement. If you select all the columns from an external table, the new table is a replica of the columns and data types in the external table. Execute the followin query, replacing “tableName” with the name of your table and “hashColumn” with the column used for the has distribution.
CREATE SCHEMA [prod]; GO; CREATE TABLE [prod].[tableName] WITH ( DISTRIBUTION = HASH([hashColumn]) ) AS SELECT * FROM [stage].[tableName] OPTION ( LABEL = 'Load [prod].[tableName]' );
By default, tables are defined as a clustered columnstore index.
After a load completes, some of the data rows might not be compressed into the columnstore. To optimize query performance and columnstore compression after a load, rebuild the table to force the columnstore index to compress all the rows: ALTER INDEX ALL ON [prod].[tableName] REBUILD;
Verify number of rows: SELECT COUNT(1) FROM [prod].[tableName];

Monitor and confirm the migration success
While loading from the external table in the previous step, you can execute the following query to monitor the migration.

SELECT * 
  FROM sys.dm_pdw_exec_requests r 
  JOIN sys.dm_pdw_dms_workers w on r.request_id = w.request_id
WHERE r.[label] = 'Load [prod].[tableName]'
ORDER BY w.start_time desc;

In the query result you will find a type column showing “Writers” since the query refers to the loading operation. Also there are exactly 60 rows which correspond to 60 writers.

Confirm 60 distributions into the final table
We can verify the data was loaded into the 60 distributions: DBCC PDW_SHOWSPACEUSED('prod.tableName') . Indeed the query returns exactly 60 rows and tells us how many rows of the table are allocated in each distribution.

Conclusion

By now you should be able to migrate tables from an on-premises SSMS to a dedicated SQL pool in Azure Synapse Analytics. We recommend to follow other tutorials, which are available on the Microsoft website.

Thank you for reading 🙏 and subscribe to the Betacom publication 💡