Azure Synapse Data Load using Polybase or Copy Command from VNET protected Azure Storage

Inderjit Rana
Microsoft Azure
Published in
8 min readJul 27, 2020

Polybase and Copy Command are two most prominent methods for performing high throughput loads from Azure Storage to Azure Synapse. In case you are new to these two methods please review at least couple sections from my previous post Practical Tips on Polybase data load to Azure Synapse. In this post I will share detailed instructions for loading data from a VNET protected (VNET Service Endpoint as well as Private Endpoint) Azure Storage account to Azure Synapse using Polybase and Copy Command. The information already exists in the Azure documentation but somewhat spread around and my goal was to consolidate the access patterns as well as provide a full end to end example with test scripts. The load method described here is relevant when sqlcmd or SQL Server Management Studio is used to perform the load, if you are planning using Azure Data Factory (ADF) I would highly encourage you to read the new Managed VNET for ADF feature which was very recently released in Preview. This post is very similar to one of my previous posts Firewall Protected Azure Storage and Azure SQL Database — Load Data using T-SQL BULK INSERT command except the following differences:

  • Azure Synapse (formerly Azure SQL Data Warehouse) is the destination for data load and not Azure SQL Database
  • Data Source is Azure Data Lake Gen 2 (ADLS Gen2) which is specific configuration for Azure Storage (in case you are new to this, when creating Azure Storage Account select Enabled for “Hierarchical Namespace” setting)
  • Polybase and Copy Command for data loading used rather than the BULK Insert command
  • Covers both VNET Service Endpoint and Private Endpoint methods of network protection

Please note that this post is for Azure Synapse and not the new Azure Synapse Workspace which is in Preview at this time.

Background

It’s a very common requirement especially among customers in Regulated Industry that the Azure Storage and Azure Synapse are setup with strict network controls. There are two methods to setup these network controls for PaaS Services (Azure Synapse and Azure Storage in this case):

  1. VNET Service Endpoint — This has been around for a while and works by explicitly whitelisting (allow) connections to the PaaS resource from within the customer owned VNETs in Azure or specific list of IP Ranges (commonly NAT IP Ranges for the Corporate Data Center). This method still relies on Public DNS Name hence Public IP of the resource.
  2. Private Endpoint — This is relatively new and effectively brings the PaaS resources (Azure Synapse and Azure Storage in this case) inside a customer owned VNET by assigning a Private IP Address to the PaaS resource.

The Private Endpoint option is considered superior but it does add complexity as well as some extra costs. It does require DNS setup, in case where resources accessing storage or Synapse are in Azure this is somewhat simplified but for on-premises access you would need help from the networking teams. I am documenting the details for both VNET Service Endpoint as well as Private Endpoint options. If you are set on one option to use you can jump to the relevant section.

Architecture Diagrams and Networking Settings — VNET Service Endpoint

The following diagrams shows the interaction between different components. Its good to point out that Azure VM for load command execution is one option but any other resource like Azure Container Instance running inside a VNET should work too.

VNET Service Endpoint — Load Command execution from within Azure
VNET Service Endpoint — Load Command execution from within Azure

The highlighted settings in the above diagram are important to understand and apply correctly.

ADLS Gen2 Network Settings

  • Make the selection Allow access from Selected Networks
  • VNET and the Subnet used for the Virtual Machine (VM) will need to be allowed on the Azure Storage.
  • Allow Trusted Microsoft services to access this storage account must be checked.
ADLS Gen2 Network Settings

Azure Synapse Network Settings

Please review relevant sections from my article Azure SQL network settings (Private Link, VNET Service Endpoint) and Azure Data Factory where I explained in detail the meaning and implications of these settings.

  • Set Deny public network access to No
  • Set Allow Azure Services and other resources to access this server to No
  • VNET and the Subnet used for the Virtual Machine (VM) will need to be allowed on the Azure Storage.
Azure Synapse Network Settings

Command execution from on-premises

For the sake of completeness I would like to point out that its not necessary to run the load from Azure VM, its possible to run the command from on-premises as well. The only difference would be that instead of allowing specific VNET on the Azure Storage or Azure Synapse you will need to either allow Corporate Data Center IP Ranges or setup an Express Route with Public Peering Configuration. The following diagram shows the architecture of the setup where commands are executed from on-premises, connectivity on Port 1433 must be allowed for this to work from on-premises.

VNET Service Endpoint/IP Firewall — Load Command execution from on-premises
VNET Service Endpoint/IP Firewall — Load Command Execution from on-premises

Architecture Diagrams and Networking Settings — Private Endpoint

The following diagram shows interaction between different components. Its good to point out that Azure VM for load command execution is one option but any other resource like Azure Container Instance running inside a VNET should work too.

Private Endpoint — Load Command execution from within Azure

The highlighted settings in the diagram above are important to understand and apply correctly. The following screenshots show these settings applied to ADLS Gen2 and Azure Synapse on Azure Portal.

ADLS Gen2 Network Settings

  • Make the selection Allow access from Selected Networks
  • Allow Trusted Microsoft services to access this storage account must be checked. [Yes, this is still need even when using Private Endpoints]
  • Create Private Endpoint for ADLS Gen2, when creating the Private Endpoint for Azure Storage account make sure you are selecting “dfs” as the Target Sub-resource .
ADLS Gen2 — Firewall and Virtual Networks
ADLS Gen2 — Private Endpoint
ADLS Gen2 Private Endpoint Target sub-resource “dfs”

Azure Synapse Network Settings

  • Deny Public Network Access set to Yes
  • Allow Azure Services and resources to access this server set to No
  • Private Endpoint added for Azure Synapse
Azure Synapse — Firewall and virtual networks
Azure Synapse — Private Endpoint Connections

Command Execution from on-premises

For the sake of completeness I would like to point out that its not necessary to run the load from Azure VM, its possible to run the command from on-premises as well. The architecture diagram is shown below and the difference is that a VPN or Express Route Private Peering connection would be required.

Private Endpoint — Load Command execution from On-Premises

Test Scripts

Irrespective of whether VNET Service Endpoint or Private Endpoint are used for applying network controls the steps to enable load using Polybase or Copy Command are pretty much the same except some minor differences.

  1. Enable Managed Identity on Azure Synapse, you will need to use Azure CLI or Azure Powershell step as there is no way to perform this step on Azure Portal at this time.
az login
az account set -s <subscriptionid>
az sql server update -g <resorucegroupname> -n <synpaselogicalservername> -i

2. Grant Storage Blob Data Contributor role to Azure Synapse Identity on Azure Storage — From the Azure Portal, navigate to your storage account, navigate to Access Control (IAM), and select Add Role Assignment and then add Storage Blob Data Contributor role to the Managed Identity created for Azure Synapse in Step 1 (an Azure AD Identity for Azure Synapse gets created in Azure AD when Managed Identity for Azure Synapse is created).

3. Create destination table in Azure Synapse, in case of Polybase this step is optional since CTAS can be used you to create the table during load.

CREATE TABLE [dbo].[airports]
(
[airport_id] [int] NOT NULL,
[city] [nvarchar](500) NOT NULL,
[state] [nvarchar](255) NOT NULL,
[name] [nvarchar](255) NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
HEAP
)

4. Create CSV file and load it to Azure Storage Account container(please be mindful of Endof Line characters when you copy/paste, I used Notepad++ on a Windows machine so used CRLF as the End of Line)

airport_id,city,state,name
10165,Adak Island,AK,Adak
10299,Anchorage,AK,Ted Stevens Anchorage International
10304,Aniak,AK,Aniak Airport
10754,Barrow,AK,Wiley Post/Will Rogers Memorial
10551,Bethel,AK,Bethel Airport
10926,Cordova,MK,Merle K Mudhole Smith

5. Run the commands to perform the load using the desired option below. User executing the commands needs to have appropriate permissions for Polybase or Copy Command whichever is used (please refer to my previous post Practical Tips on Polybase data load to Azure Synapse as it has details around permissions for these commands).

Load using Polybase

  1. Create the dependencies needed by Polybase command (please replace the placeholders like storage account name, container, file name, etc. denoted using <>). Its important to point out that the Identity value of “Managed Service Identity” in the statement CREATE DATABASE SCOPED CREDENTIAL msi_cred WITH IDENTITY = ‘Managed Service Identity’ need to be spelled exactly this way.
create master key encryption by password = '<password>';
go
CREATE DATABASE SCOPED CREDENTIAL msi_cred WITH IDENTITY = 'Managed Service Identity';
GO
CREATE EXTERNAL DATA SOURCE [AzureDataLakeStorage] WITH (TYPE = HADOOP, LOCATION = N'abfss://<stgcontainername>@<stgacctname>.dfs.core.windows.net', CREDENTIAL = [msi_cred]);
GO
CREATE EXTERNAL FILE FORMAT [AirportsCsvFileFormat] WITH (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = N',', FIRST_ROW = 2, USE_TYPE_DEFAULT = False));
GO
CREATE EXTERNAL TABLE [dbo].[Airports_external]([airport_id] [int] NOT NULL, [city] [nvarchar](500) NOT NULL, [state] [nvarchar](255) NOT NULL,[name] [nvarchar](255) NOT NULL) WITH (DATA_SOURCE = [AzureDataLakeStorage],LOCATION = N'/<CSVFileName>',FILE_FORMAT = [AirportsCsvFileFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 0);
GO

2. Load the data from External Table to Azure Synapse Table, the script below creates the airports table but if you pre-created the table then use INSERT INTO rather than CTAS

Create table [dbo].[airports] with (heap, distribution=ROUND_ROBIN) as select * from [dbo].[airports_external];
GO

Load using Copy Command

The destination table must exist. perform the load using the following command (please replace the placeholders like storage account name, container, file name, etc. denoted using <>). Its important to point out that the Identity value of “Managed Identity” in the statement below needs to be spelled exactly this way.

copy into dbo.airports
from 'https://<stgacctname>.dfs.core.windows.net/<stgcontainername>/<CSVFileName>'
with (
file_type = 'csv',
credential = (Identity = 'Managed Identity'),
FIRSTROW = 2
)

--

--