ADF Snowflake Connector: A Bit Underbaked

Why the new ADF Snowflake Connector should have stayed in the oven a little longer.

Kieran Healey
Hashmap, an NTT DATA Company
9 min readJun 12, 2020

--

by Kieran Healey and Senthil Anandarangan

It is not often we write pieces like this: Here at Hashmap, we ❤ new ways to interact with Snowflake, however; we regret to inform you that the new ADF Snowflake Connector is a technology that feels like it was from early 2018, not mid-2020. If ADF is trying to compete with other data integration tools (Fivetran, Matillion, and Qlik) then the new ADF Snowflake connector falls short in several key areas. We find this a disappointing release from Azure Data Factory Team as we were hoping the tool would be more akin to these technologies in terms of functionality and scope. This quote summarises these feelings about the new ADF connector succinctly:

“A delayed game is eventually good, but a rushed game is forever bad.” — Shigeru Miyamoto

While Mr.Miyamoto is talking about games here, his words ring true in any product situation. From video games, to architecture, to data: if you rush a product, that reputation then sticks with you, no matter how much you improve after launch. Products like No Man’s Sky, Zune, Intel’s i9–10900k: all of these products were rushed to completion and were/are critically and commercially panned. No amount of Day 1 patches or future content updates will draw in customers.

The new ADF Snowflake connector feels the same way as if to mitigate market share loss rather than a proactive move or at the very least a reactive one. We hope that the Azure Data Factory Team is not discouraged by this, but rather they can learn, grow, and become a true competitor to these tools mentioned above. To do that, they will need to fix four key areas of their product. Those such areas are Ease of Setup and Configuration, Sustainability, Management overhead, and most importantly, Performance.

Selecting Snowflake as a DataSet in Azure Data Factory

Ease of Setup + Configuration

Here is a step by step instruction on how to configure the tool:

  1. Create a Blob Storage DataSet in Azure Data Factory, this should be the data you wish to load. (Limitation Alert #1: The connector only supports Delimited Text File loads and Parquet at this time) You will also need to create a linked service to your storage account where your data resides.
CSV Blob DataSet, N.B. the row delimiter

2. Set Up the Snowflake Linked Service and Connection, you will need your Azure account name and the account that has the correct privileges to run on the Snowflake environment. Then you will give the database and the warehouse you wish to run on.

How to configure the Snowflake Linked Service

3. Create your Copy Data command and configure the Source and Sink Tabs.

Example of a configured source

4. Create your Table and Schema in your Snowflake environment.

Create Table for your Data to land in

5. Run your Azure Data Factory Pipeline to load the data into Snowflake. You should see in Snowflake a call to the ODBC in your history table. This is how you know ADF has successfully landed data in Snowflake. If you check the Command run you should see something to this effect.

What ADF sends to Snowflake after you configure your Azure Data Factory

Notice here for a moment the name of the Azure Data Factory Stage. This is important to note as this is the name of the intermediate blob storage area that gets created in Azure. I will return to this later in the Sustainability section. Just know it is not ideal.

The Gotchas

When firing up this tool, Senthil and I found the Setup and Configuration to be the most challenging part. The ADF Snowflake Connector seems to restrict the copy functionality we love from Snowflake. If we deviated from these predefined accepted parameters in any way, we get a validation error from Azure Data Factory that directs us to a new problem such as wild cards not being defined in the source, but in the copy command itself. These error messages took some time to fix and were not intuitive. Here is a list of the pain points we encountered:

  • Snowflake needs you to Create Table Ahead of Time (Need to manually recreate, if schema changes).
  • /n, /r, or edit custom separator otherwise not supported (Validation Error).
  • Can’t use the first line as header (Validation Error).
  • FALSE FLAG for COPY INTO is set to the opposite option that Snowflake has by default. This causes old data to be reloaded rather than only the new data.
  • Re-import Schemas when changes are made to the Source Data.
  • Recursive box must be triggered (Validation Error).
  • Needed to have a blob folder to run more than one file, or utilize wildcard ingestion.
  • Cannot set up wildcard ingestion from source, has to be overridden in the Copy Command in the ADF pipeline.

What particularly concerns us are the Validation Errors. What happens if your data does not fall into the nice format ADF expects? You will have to waste more time creating an Az Function to handle that transformation yourself. Here is a picture of showing what needs to be enabled to allow the connector to run without the Validation Errors.

Pain Point Example: Inside the ADF Pipeline, these two must be set here instead of the dataset

Sustainability

When it comes to Sustainability, it is often easiest to understand the mechanics behind a process. Snowflake has many ways to Copy Data into the underlying Tables. It is important to note how the Azure Data Factory Connector handles copying data into Snowflake. The Connector currently has two modes: A Direct Copy and a Staged Copy. The Direct Copy is limited in its use as any of these scenarios below force the user to perform a Staged Copy.

1. Source data format does not meet the criteria of ADF loading into destination data store.

2. Source data that is on-Prem/Hybrid cloud environment and needs decompression to efficiently load into destination data store.

3. Securely transport data from source and destination without opening up 443 ports.

If your data environment requires any of these things, you will not be able to utilize the Direct Copy and will be forced to use Stage Copy. This presents its own challenges not from a security or functionality perspective, but from a cost perspective.

A Staged Copy brings the source data to an interim temporary stage location, then performs the necessary transformation and will finally send the data to the destination store. After the transfer of data is complete, the data will be expunged from the temporary location.

Staged Copy Works with ADF Connector

So how does one configure the Staged Copy? Well in the Snowflake Sink Options in the Azure Data Factory Pipeline, you must enableStaging set to be TRUE. Snowflake automatically compresses the using their standard AUTO compression, however, you can set the compression based on your preference the options depending on the file type. There are your standard GZIP, BZ2, and DEFLATE to name a few for Flat Files. Parquet has either SNAPPY or LZO. There is a list of compression options that can be found here.

Another aspect of Sustainability is the cost of execution, currently, the execution of an ADF pipeline to Snowflake will cost costs #DIUs * copy duration * $0.25/DIU-hour in addition to the cost Snowflake will charge. This is because ADF still generates an intermediate Blob area where it writes the data. We initially thought this step would have been cleaned up or dropped thanks to the new ADF Connector. This is something that seems like a large oversight in this case.

Management

To give credit where credit is due, the ADF Snowflake connector does move data from Blob to Snowflake well. However, no Auto Adjust with regards to source schema, nor any CDC functionality leaves the connector feeling a little bare-bones from a data management perspective. This problem is only exacerbated at scale. What happens when a company needs to load data from sources other than flat files from blob into Snowflake. Well, they will need to revert back to building the pipeline manually or using a tool like Fivetran, Matillion, or Qlik. What if you need CDC functionality? Then your team needs to either build their own custom CDC Tool or find a tool that has it baked in. No enterprise has just flat files they need loading to Snowflake. By using this tool you are committing your business to add additional methods and tools for your developers to learn unnecessarily, ultimately slowing down development and your insights.

The lack of features provided from the Azure Data Factory Team dooms this tool as is a net negative to development stacks that would like a native Azure experience to connect to Snowflake.

This error message sums up the experience well from a management perspective:

Direct copying data from Snowflake is only supported when sink dataset is DelimitedText or Parquet with Azure Blob Storage linked service, for other dataset or linked service, please enable staging

To utilize copying data from other methodologies forces the user to adopt a staged approach, thus making it more expensive to operate than a direct copy.

Performance

With the lack of features, you would hope that the ADF Connector performed better in a speed test as compared to setting the stage yourself and performing a COPY INTO Command. Unfortunately, this is not the case: Here are the results of loading a 5.3 GB csv file into Snowflake using both methods:

ADF Connector: 345 seconds

BYODP (Build-Your-Own-Data-Pipeline): 276 seconds

ADF is 20% slower than creating a stage yourself and using the copy into command on Snowflake. I think this is the final nail in the coffin for the ADF Connector. The connector isn’t faster, scalable, lacks cdc, harder to set up and/or manage, and is not sustainable. The benefits gained from other tools and even doing it by yourself are greater than using a tool that is already obsolete in every way. Save your future self the headache of having to redesign around this tool for loading other sources.

In Conclusion

We would like to start our conclusion by saying this is not meant as an attack against the Azure Data Factory Team. Here at Hashmap, we strive to provide the best experience for our customers. We pride ourselves on being among the first to adopt and learn about new technology regardless of platform. Today, we were hoping that this connector could help our customers leave some vendor paid tools and potentially lower costs while at least maintaining functionality. This is not the case today. We believe that the Azure Data Factory Snowflake Connector will grow features in time, but we won’t be recommending our customers to switch any enterprise data loads to this method anytime soon.

Other Tools and Content for You

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Kieran Healey and Senthil Anandarangan are Cloud and Data Engineers with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--

Kieran Healey
Hashmap, an NTT DATA Company

Full Stack Data Guy — likes blogging about new technologies and sharing simple tutorials to explain the tech.