A Scalable Solution for Integrating and Transforming Data into Snowflake in Near Real Time

Bswarbrick
Cervello, a Kearney Company
8 min readSep 21, 2020

As organizations strive to be competitive, access to timely information is becoming ever more critical. To address the increasing demand for real time analysis corporations are increasingly moving to cloud based solutions that provide on-demand capabilities for managing data not previously possible with legacy technologies. These capabilities include non-intrusive mechanisms for the movement of data across platforms, storage and database technologies that provide on demand control over storage, and processing capabilities all at a low cost.

At a recent Cervello client, our challenge was to replace an existing solution to provide near real time reporting from hundreds of source databases and thousands of tables. As Change Data Capture tools were being evaluated to replicate on premise data to the cloud (including the Snowflake Cloud Data Platform) we needed an interim solution to provide near real time access to high value data in Snowflake in a short timeframe

Cervello addressed this challenge by developing an agile solution to stream both structured and semi-structured data (including simple and complex JSON data formats) into an Azure Data Lake that incorporated both Azure BLOB and ADLS storage mechanisms. From there data was streamed in near real time into the Snowflake Cloud Data Platform and then transformed into a format that could be used efficiently by downstream users and applications.

In this article we discuss the above solution that leverages a few key components of Snowflake: Snowpipe, Tasks, and Procedures:

· Snowflake Snowpipe — for streaming structured and semi-structured data from BLOB and ADLS into Snowflake.

· Snowflake Tasks — to automate the transformation processes.

· Snowflake Procedures — for building a rule driven engine for transforming and automating that data in near real time into a format that can be accessed efficiently and reliably by end users and applications.

Understanding the Source Data

Streaming data may be presented in both structured and semi-structured formats. Semi-structured formats (which are common especially with streaming technologies) may include data stored as simple JSON elements or in more complex formats with nested arrays where a single JSON payload may map to many logical structures that Users are familiar with.

Streaming data is often presented as a series of modifications to a transaction that is typically tagged with an add or change action, contains a Primary Key to which the modification are applied and are generated in a specific order that is typically identified using a sort sequence. The streaming data may contain deleted transactions that are identified explicitly with a delete action or implicitly via the absence of data in relation to other elements (for example a nested JSON structure that provides a point in time snapshot of data may indicate a change for one element of the JSON array but absence of other elements that should be processed as a delete).

Transforming the data into a format suitable for analysis may require sifting through unprocessed transactions, sorting and data de-duplication, mapping to target structures and merging changes with existing target data. This is relatively straightforward for structured data but more challenging when data is in a semi-structured (and especially complex JSON) format.

Solution Overview

So how do we leverage the Snowflake Cloud Data Platform to provide a scalable solution for integrating and transforming data into Snowflake in near real time? Below is an illustration of how Snowflake capabilities can be leveraged to achieve this:

Snowpipe

Snowpipe is Snowflake’s continuous data ingestion services that loads data from cloud storage (for example Azure BLOB/ADLS) into the Snowflake database. It is a serverless compute model where Snowflake manages load capacity, ensuring optimal compute resources to meet demand (Snowflake reference https://docs.snowflake.com/en/user-guide/data-load-snowpipe.html).

Snowpipe runs independently of upstream data loading to BLOB/ADLS to append to existing data in the target database. Snowpipe uses the Snowflake COPY function under the covers that allow for tagging of additional load metadata that can be used to support downstream CDC processing. For the automation solution described here, data loaded into Snowflake via Snowpipe also uses Snowflake capabilities to add the target load timestamp that can be used to identify unprocessed data if needed. Snowflake also provides a capability known as STREAMS that can used to process data downstream based on new data loaded without having to manage that logic outside of Snowflake if desired.

To support semi-structured data formats Snowflake provides a Variant data type that supports semi structured data formats ( for example JSON) and provides a SQL extension that allows JSON structures to be used in SQL statement along with structured data. These formats for storing data in Snowflake are extremely flexible and allow data to be queried in place using simple SQL commands.

Tasks

To support near real time processing of data downstream the solution described leverages Snowflake’s Tasks. Snowflake Tasks allow scheduled execution of SQL Statements that are run according to a specific execution configuration using any combination of a set interval and/or flexible schedule using a subset of familiar cron utility syntax. Tasks currently run in a user specified warehouse (Snowflake reference https://docs.snowflake.com/en/user-guide/tasks.html).

Procedures

Snowflake Tasks also allow the execution of Stored Procedures. Snowflake Stored Procedures provide an extension to SQL that combines the power of SQL and JavaScript to create powerful capabilities for building procedural logic and for generating and executing dynamic SQL.

Putting it all together

So how do we use these capabilities to create a metadata driven solution for transforming source transactions (in a variety of formats) into usable formats for users and applications that leverages minimal development (for example by encapsulating all workload in a common procedure) and provides the capability to ingest new sources quickly? Through the use of accelerators and a metadata driven approach this can be achieved as follows:

· Use Snowpipe to ingest data in real time from BLOB/ADLS into a Snowflake table with semi-structured data loaded into a Variant data type (or a structured format if source data is structured). Include additional metadata (for example load timestamps and data lineage metadata that provides lineage back to the data files themselves) in this step.

· Create structured target tables in Snowflake that store the result of processing the incoming data. Define Primary Keys on these tables so that this metadata that can be used to support sort and merge logic downstream (using accelerators to aid this effort if DDL can be converted from available sources). Defining Primary Keys in Snowflake ensures code that needs access to keys can reference the key definitions from the source of truth and ensures that code can be generated reliably.

· Create a Stored Procedure in Snowflake that generates dynamic SQL and transforms incoming data into target formats based on business rules (metadata) provided. Common actions of this procedure include the following:

> Extracting the target table DDL (Tables, Columns, Data Types and Ordinal Positions) and Primary Key details from the Snowflake Information Schemas based on metadata rules provided.

> Generating dynamic SQL select statements that extract changed data from the source tables loaded via Snowpipe. If source structures are Variant and simple JSON then create database views ‘on the fly’ that parse the JSON into a structured format. If source JSON formats are complex, build views in advance that parse complex JSON into structured formats that map to target tables and reference these objects in the procedure.

> Mapping data to target data types based on the Information Schema metadata that includes data type, precision and scale. Setting session options for expected source data formats that include Date/Time and Binary data based on metadata settings.

> Sorting data based on Primary Keys and using window functions to generate the most recent transaction attributes for the Primary Key. This process leverages source sequence and load timestamp metadata to ensure data is processed in the correct order.

> Merging transformed data to target tables with consideration for the processing of deleted transactions (leveraging metadata on the transactions themselves or through inference if required. For complex JSON a deleted element may require tracking parent elements in the JSON source).

· Use Snowflake Tasks to automate the process and choose appropriate compute resources to handle expected load. Define Snowflake Warehouse names in the metadata and use accelerators to generate the Tasks if desired.

Conclusion

As organizations move from traditional batch processing of data to right time and streaming solutions, JSON as a data format is becoming more common and is a preference for storing data in the cloud, especially when usage for the data may not be fully understood. Snowflake not only provides a mechanism for storing this format in the database through its Variant data type but also provides a SQL mechanism for accessing this data. Users familiar with SQL can learn to access JSON formats through Snowflake’s SQL extensions. Snowflake automates the movement of data from Cloud platforms and provides capabilities to process that data downstream using Tasks with the ability to execute SQL Statements and Stored Procedures.

Taking those components and building reusable solutions allow for developing consistent code that eliminates the potential for coding errors. New sources can be incorporated quickly when existing solutions can be leveraged out of the box. Testing time is dramatically reduced and, in some cases, eliminated altogether.

Now as a developer wouldn’t that make you happy?!

About Cervello, a Kearney company

Cervello, is a data and analytics consulting firm and part of Kearney, a leading global management consulting firm. We help our leading clients win by offering unique expertise in data and analytics, and in the challenges associated with connecting data. We focus on performance management, customer and supplier relationships, and data monetization and products, serving functions from sales to finance. We are a Solution Partner of Snowflake due to its unique architecture. Find out more at Cervello.com.

About Snowflake

Thousands of customers deploy Snowflake’s cloud data platform to derive insights from their data by all their business users. Snowflake equips organizations with a single, integrated platform that offers the data warehouse built for the cloud; instant, secure, and governed access to their network of data; and a core architecture to enable many other types of data workloads, such as developing modern data applications. Find out more at Snowflake.com.

--

--