Schema Unification with Snowflake
A Design Made Simpler
By Chinmayee Lakkad
Nature, as we all know it, thrives in diversity. The more, the merrier. The more varied, the merriest! To truly appreciate nature one has to take in all its myriad possibilities, all of its variability. One has to accept that ‘change is the only constant’. Perhaps that is what makes it so interesting.
Er…but not so charming when you’re a data engineer. You are tasked with ingesting data with more-than-you-care-for variability in its schema, typical with data that comes from heterogeneous sources. All of the harmonization and unification then needs to be built-in before you can even begin to do any of the necessary transformations, which is all the business really needs, or cares about, at the end of the day!
So, you do what data engineers usually do — figure out the different sources, study their schema, all the while, praying that they don’t at least change while you’re still working on understanding them! All this, and then either
1. If your downstream is relational, and you can E-L-T
Deal with structured data formats, such as CSV you create as many ‘stage’ tables as are required to host each individual dataset to support its schema, and then write some SQL code to perform expensive JOINs and/or UNIONs to bring it all under one hood.
2. No matter the downstream, you wish to E-T-L
You whip together some code, fine-tuned to accept the known schemas (some more praying here) and transform the datasets on-the-fly. All before you host them in your database and hopefully move on to the real deal — whatever that may be for your use case.
Oh, but wait! here’s nature calling again (no pun intended), and you find that one of the data sources has changed (aka ‘evolved’) its structure. So now you have something similar, but with a different makeup to include now, urgh!
But, as inconvenient as having to deal with data’s structural variability might be, there’s really no avoiding it. And perhaps, this is what gives analysis its power — data captured with so many details, from so many different points of view.
Therefore, we must endeavor to make this whole process of ‘schema harmonization’ easier and as close to standardization as possible. Or, even better, we should try to retain the original schemas of all the datasets while somehow storing them together, within the same data structure, so that we can later access as if they are a single physical entity. You know, have our cake and eat it too!
If only there were a tool, a system out there that empowers data engineers to store data with all of its raw pieces of information intact and yet combine them without too much pre-processing/modeling effort.
Enter, Snowflake Cloud Data Warehouse and its native support for semi-structured data formats like JSON, AVRO, and Parquet to name a few. Why semi-structured format, you ask? Read on about our real-life scenario where converting structured CSV to JSON made the data pipeline and hence our life so much easier.
A Real-Life Tackle with Data’s Structural Capriciousness
At Hashmap, while working with a client in the energy industry, we came across a use case which dealt with field data being collected from different sites within the same business process, but where data which originated from different sites were not guaranteed to look like one another — and didn’t. The requirement was that all of this data had to be ingested through a single processing pipeline that aimed at generating important operational KPIs to be delivered in real-time.
There was one solution already in place which utilized Apache Spark scripts that performed the tasks for unionizing the disparate datasets, before applying the required business transformations. Although the solution worked, it was cumbersome to understand, ergo maintain. The solution was not modular enough to give it the flexibility to absorb changes to underlying datasets easily or quickly. Besides this was an on-prem solution which meant that the cluster sizing and tuning was completely the onus of the data engineering team that built it.
To make things simpler, Hashmap re-worked the data pipeline to transform the original datasets into JSON, retaining each datasets’ native structure and simply loaded each into a Snowflake table, such that each data point became a row with its own specific schema. A family of sorts!
Store Your Semi-Structured Data in Snowflake
Snowflake allows for storing data in semi-structured formats such as JSON, AVRO, and ORC (to name a few), under a column of type VARIANT. It stores the underlying data in its native format, compressed for efficient storage and access which provides amazing flexibility and frees the data engineer from having to transform the data any further to make it conform to a strict schema definition. In other words, it enables the engineers to not only have access to all data in one place (a single relational table) but with the originality of each data point intact.
So we loaded our disparate datasets to create a pool of data in one Snowflake ‘stage’ table, and then utilized it all to build the necessary operational metrics using Snowflake’s SQL functionality for extracting data from a JSON backed column. The data came from varied sources but we got to treat it all as one table with uniform SQL access patterns. We combined Snowflake with another nifty tool by the name of dbt (Data Build Tool)’ to organize our SQL source and even represent our transformational data pipeline as ‘Select’ SQL to generate the metrics. Dbt handled the DDL for us and in the dialect used by Snowflake.
What About Changes to the Data?
At this point, even if the structure of any of the existing datasets were to change or new variants were to be introduced into the pipeline, nothing would need to be changed in the rest of the pipeline as far as ingesting them goes, given the simple function of data reformatting into JSON is in place. The only change could be adding new columns to the KPI calculations themselves, as-and-when required. This untangled a very messy knot at the beginning of the pipeline, made the overall design more simplified, and modularized — hence easier to maintain and evolve.
Not to forget that Snowflake is an elastic Data Warehouse built on the cloud! So there is literally zero cluster provisioning/tuning required to be done by data engineers while implementing a case.
But, just to be clear, in comparison to the previous solution, there is still a pre-processing step involved to deal with the disparity. But it does not include any complex logic of UNION-izing the datasets by loading them into a data processing engine and dealing with them one-by-one. Only a simple operation that converts the source data format, CSV in our case, into JSON (or any format supported by Snowflake as VARIANT) — something that can be done with a simple serverless processing step.
There have been numerous NoSQL databases in the market for some time now, which allow data to be stored as documents in more weakly-defined structures, and hence could be helpful when dealing with colorful data. But none of them have proven to provide better data interactivity than a trusty SQL engine, nor with as much ease as Snowflake. Moreover, in the case of Snowflake, the sizes of the datasets which can be ingested potentially have no upper limit in size, and neither do the compute resources that would be needed to dig into them. So there, have your cake and eat it whenever you want too!
Check These Stories Out As Well
Securely Using Snowflake’s Python Connector within an Azure Function
Why and How to use Key Vault
Don’t Do Analytics Engineering in Snowflake Until You Read This (Hint: dbt)
Using dbt To Create Tables Using Custom Materialization
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.
Chinmayee Lakkad is a Data Engineering Consultant with Hashmap working on designing and developing data pipelines on Big Data, Data Warehousing, and Cloud Platforms for the better part of the last 6 years.