Seamless migration to Snowflake using ADF Script Activity + Schema detection

Recently, I have been working with a few customers in their attempt to migrate from Azure Synapse Analytics to Snowflake to reduce maintenance and administration efforts for their team. The core component in their existing framework for data pipeline is ADF (Azure Data Factory).

The standard way of moving data from source to target using ADF leverages a copy activity. However, the copy activity with Snowflake as the sink has a few limitations.

To use Direct Copy to Snowflake, the sources are limited to Parquet, Delimited text, or JSON. Therefore for source such as SQL Server, Postgresql or Azure Synapse Analytics, an interim Azure Blob storage is required for Staged Copy. This approach will provide a better throughput as it leverages the COPY INTO command for batch loading into Snowflake.

In order to use the staged copy activity, a linked services for the interim Azure blob storage requires authentication. The only supported authentication is via a shared access signature authentication (SAS Uri) which in theory — allows you to specify the time span and permissions allowed for access to a storage sources such as blob or container.

For organisations using Azure service principals as the preferred choice in managing identify across application, hosted services and their Azure resources. The thought of regenerating SAS tokens again and again to ensure the connectivity does not expire brings an added administration overhead.

The other limitation using copy activity is the added tasks of creating object DDLs in Snowflake prior to a successful execution of the copy activity. When organisations have over hundreds of tables and views. This process will require DDL translation between T-SQL and ANSI SQL which is tedious and mundane to say the least!

With that being said, an opportunity for enhancement presented itself with 2 key feature announcements.

  1. New ADF Script Activity (courtesy of Regan, thank you for the recommendation) — https://techcommunity.microsoft.com/t5/azure-data-factory-blog/execute-sql-statements-using-the-new-script-activity-in-azure/ba-p/3239969
  2. Snowflake Schema Detection — https://www.snowflake.com/blog/schema-detection-public-preview/

Now, I’ll show you how the following ADF pipeline can be executed without the need of DDL creation or SAS token authentication for Azure storage.

Before implementing the ADF pipeline, here are a few prerequisite

In the snowflake account, setup a database, a compute warehouse for loading data and parquet file format.

USE ROLE SYSADMIN;
CREATE OR REPLACE DATABASE ADVENTUREWORK;
CREATE WAREHOUSE LOAD_WH INITIALLY_SUSPENDED = TRUE AUTO_SUSPEND=60;
CREATE FILE FORMAT MY_PARQUET_FORMAT TYPE = PARQUET;
  • Azure Storage (ADLS Gen2 or Blob Storage) created in your Azure tenant. If you are already using Azure Synapse analytics, the primary ADLS Gen2 account can be used.
  • Storage integration between the Azure Storage and Snowflake (External Stage) — A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage

ADF Pipeline

ADF pipeline — Copy Activity, Script Activity, Script Activity
  1. Copy Activity

Configure the link service for source (Azure SQL or Azure Synapse Analytics SQL Pool) with the sink as Parquet on the Azure Storage (ADLS Gen2 or Blob storage). This activity will copy the SalesLT.Address table into the storage account as parquet.

Source — AdventureWorks.SalesLT.Address
Sink — SalesLT.Address.parquet

2. Script Activity (CreateDDL)

The new script activity is very powerful as it enables native push down queries to be executed on the Sink (snowflake). This will leverage Snowflake compute (virtual warehouse) which is consumption based (Pay for what you use).

Since we already have the storage integration configured between Snowflake and ADLS Gen2 with the external stage name ‘@stagedcopy’. I can script the CREATE TABLE command against the staged parquet file created in step 1.

Script activity — Schema Detection
USE ROLE SYSADMIN;
USE DATABASE ADVENTUREWORK;
USE SCHEMA SALESLT;
USE WAREHOUSE LOAD_WH;
create or replace table ADDRESS
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location=>’@stagedcopy/address’,
file_format=>’my_parquet_format’
)
));

The query uses Snowflake’s table function INFER_SCHEMA to automatically detect file metadata schema from the external stage. The CREATE TABLE .. USING TEMPLATE expands upon Snowflake’s CREATE TABLE functionality to automatically create the structured table using the detected schema from the staged files with no additional input.

3. Script Activity (CopyIntoSnowflake)

This step will execute a COPY INTO command which loads the SalesLT.Address.parquet file from external stage into the Address table created from step 2.

Script activity — COPY INTO Statement
USE ROLE SYSADMIN;
USE DATABASE ADVENTUREWORK;
USE SCHEMA SALESLT;
USE WAREHOUSE LOAD_WH;
COPY INTO SALESLT.ADDRESS
FROM ‘@stagedcopy/address’
FILE_FORMAT = ‘my_parquet_format’
ON_ERROR = ‘ABORT_STATEMENT’
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
PURGE = TRUE;

The COPY INTO TABLE statement requires the MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE because INFER_SCHEMA does not return the column in the same order as of file.

In addition to this, I’ve included optional parameters to ABORT_STATEMENT if load failed and PURGE = TRUE which delete the parquet files from the external stage post execution (the cleanup process is all automated for you).

End to end pipeline

Summary

To expand on this seamless pipeline, you can setup a config table at the source which includes the list of schema and table names which you want to migrate across to Snowflake.

In ADF, you can setup a Lookup Activity which will return the list of table names from the config table, the list can be passed into a ForEach activity as parameters. The parameters would be used as dynamic input values to the above ADF pipeline (copy activity, script activity) for source and target table names.

This would provide you with a metadata driven framework to support your migration process for most relational data sources on-premises (via ADF IR) or on the cloud.

Now when your data is in snowflake. Check out Snowflake Scripting where SQL based stored procedure are now supported in Snowflake!

--

--