Car buying is fun with Snowflake Part 2

Parag Shah
Analytics Vidhya
Published in
4 min readNov 6, 2020

In part 1, vehicle inventory data was downloaded locally before pushing into Snowflake stage. It included manual steps to run Python script to download JSON data, run commands in SnowSql to upload files and then COPY INTO command to load data into Snowflake table.

I was able to automate all of it with use of Azure Logic Apps and Blob Storage. The idea is to automatically run this entire process on a daily basis to get updated inventory from each car dealer in Snowflake.

Here is how automated workflow looks like:

Steps in Red circle are expanded below
  1. Azure Logic App is the starting point of this workflow. For now I have REST API as trigger that accepts two integer inputs, start and end Ids

2. Once API is invoked, it initiates Logic App with a loop to go through each Ids from Start to End values and call VMS API

3. Logic App stores API response as an object in Blob Storage

4. An EventGrid topic is created to push an message in Storage Queue when a new blob is created in storage. This step is required to automatically invoke SnowPipe every time a new JSON blob becomes available in external stage

Querying Snowflake.AccountUsage.Pipe_Usage_History view, provides logs of SnowPipe execution

Creation of Automated SnowPipe requires some setup steps 1st in Snowflake and then with the cloud provider.

create notification integration VMS_AZURE_QUEUE_INTEGRATION
enabled = true
type = queue
notification_provider = azure_storage_queue
azure_storage_queue_primary_uri = ‘https://<storage account>.queue.core.windows.net/<queue name>
azure_tenant_id = ‘<Tenant ID>’;

//Run DESC command to get AZURE_CONCENT_URL to authenticate Snowflake account. While you do that, you will also need to go to Azure Portal > Storage Account > Queue > Access Control (IAM) > Add Role Assignments to allow new Snowflake account access to “Storage Queue Data Contributor” role.
desc notification integration VMS_AZURE_QUEUE_INTEGRATION;

CREATE STORAGE INTEGRATION VMS_AZURE_BLOB_INTEGRATION
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = AZURE
ENABLED = TRUE
AZURE_TENANT_ID = ‘<Tenant ID>’
STORAGE_ALLOWED_LOCATIONS = (‘azure://<storage account>.blob.core.windows.net/<storage container name>/’);

//Run DESC command to get AZURE_CONCENT_URL to authenticate Snowflake account. While you do that, you will also need to go to Azure Portal > Storage Account > Container > Access Control (IAM) > Add Role Assignments to allow new Snowflake account access to “Storage Blob Data Contributor” role.
desc storage integration VMS_AZURE_BLOB_INTEGRATION;

Create STAGE “VMS”.”PUBLIC”.”VMS_AZURE_STAGE”
STORAGE_INTEGRATION = VMS_AZURE_BLOB_INTEGRATION
URL = ‘azure://<storage account>.blob.core.windows.net/<storage container name>’
COMMENT = ‘Stage location for incoming JSON files from LogicApp’;


create or replace pipe VMS_AZURE_BLOB_PIPE
auto_ingest=true
integration=’VMS_AZURE_QUEUE_INTEGRATION’
comment=’AUTOMATED PIPE’ as COPY INTO “VMS”.”PUBLIC”.”VMS_AZURE_BLOB_LZ1" FROM @”VMS”.”PUBLIC”.”VMS_AZURE_STAGE” FILE_FORMAT = ( FORMAT_NAME = “VMS”.”PUBLIC”.”VMS_JSON_FORMAT”);

Once configured, SnowPipes ran without any hiccups as soon as new files were created in Blob storage. Benefit of using SnowPipe are big cost savings while loading data as compared to manual load using SnowSql. SnowPipe uses server-less concept and does need need our warehouse to be running.

The only thing that remains in order to fully automate this workflow is to replace REST API trigger with a Recurrence activity trigger in the Logic App.

Unrelated to above workflow, I also created an External Table in Snowflake to check for performance when accessing unstructured data from external stage. I experienced very slow query performance. It took XS size compute 39 minutes to flatten JSON data in 2896 blobs. Total rows after flattening: 233370.

Based on the hit on performance caused by External tables while using them for unstructured data, it will save resources by loading them into Snowflake. Once loaded, same flattening of JSON data is lightening fast. What took 39 minutes on External table, took just 3.3 seconds once loaded into a Snowflake table with variant column.

Part 3 is published. It is all about Snowflake Streams and Tasks to land data into next level landing zone.

--

--

Parag Shah
Analytics Vidhya

I live in Vancouver, Canada. I am an AI Engineer and Azure Solutions Architect. I enjoy good Coffee and Outdoors. LinkedIn: https://bit.ly/3cbD9gW