Schema Unification with Snowflake

A Design Made Simpler

Hashmap
Hashmap
Jan 6 · 6 min read

By Chinmayee Lakkad

Diversity Rules!

Nature, as we all know it, thrives in diversity. The more, the merrier. The more varied, the merriest! To truly appreciate nature one has to take in all its myriad possibilities, all of its variability. One has to accept that ‘change is the only constant’. Perhaps that is what makes it so interesting.

Er…but not so charming when you’re a data engineer. You are tasked with ingesting data with more-than-you-care-for variability in its schema, typical with data that comes from heterogeneous sources. All of the harmonization and unification then needs to be built-in before you can even begin to do any of the necessary transformations, which is all the business really needs, or cares about, at the end of the day!

So, you do what data engineers usually do — figure out the different sources, study their schema, all the while, praying that they don’t at least change while you’re still working on understanding them! All this, and then either

1. If your downstream is relational, and you can E-L-T

Deal with structured data formats, such as CSV you create as many ‘stage’ tables as are required to host each individual dataset to support its schema, and then write some SQL code to perform expensive JOINs and/or UNIONs to bring it all under one hood.

Or,

2. No matter the downstream, you wish to E-T-L

You whip together some code, fine-tuned to accept the known schemas (some more praying here) and transform the datasets on-the-fly. All before you host them in your database and hopefully move on to the real deal — whatever that may be for your use case.

Oh, but wait! here’s nature calling again (no pun intended), and you find that one of the data sources has changed (aka ‘evolved’) its structure. So now you have something similar, but with a different makeup to include now, urgh!

But, as inconvenient as having to deal with data’s structural variability might be, there’s really no avoiding it. And perhaps, this is what gives analysis its power — data captured with so many details, from so many different points of view.

Therefore, we must endeavor to make this whole process of ‘schema harmonization’ easier and as close to standardization as possible. Or, even better, we should try to retain the original schemas of all the datasets while somehow storing them together, within the same data structure, so that we can later access as if they are a single physical entity. You know, have our cake and eat it too!

If Only…

If only there were a tool, a system out there that empowers data engineers to store data with all of its raw pieces of information intact and yet combine them without too much pre-processing/modeling effort.

Enter, and its native support for semi-structured data formats like JSON, AVRO, and Parquet to name a few. Why semi-structured format, you ask? Read on about our real-life scenario where converting structured CSV to JSON made the data pipeline and hence our life so much easier.

A Real-Life Tackle with Data’s Structural Capriciousness

At Hashmap, while working with a client in the energy industry, we came across a use case which dealt with field data being collected from different sites within the same business process, but where data which originated from different sites were not guaranteed to look like one another — and didn’t. The requirement was that all of this data had to be ingested through a single processing pipeline that aimed at generating important operational KPIs to be delivered in real-time.

There was one solution already in place which utilized Apache Spark scripts that performed the tasks for unionizing the disparate datasets, before applying the required business transformations. Although the solution worked, it was cumbersome to understand, ergo maintain. The solution was not modular enough to give it the flexibility to absorb changes to underlying datasets easily or quickly. Besides this was an on-prem solution which meant that the cluster sizing and tuning was completely the onus of the data engineering team that built it.

To make things simpler, Hashmap re-worked the data pipeline to transform the original datasets into JSON, retaining each datasets’ native structure and simply loaded each into a Snowflake table, such that each data point became a row with its own specific schema. A family of sorts!

Store Your Semi-Structured Data in Snowflake

Snowflake allows for storing data in semi-structured formats such as JSON, AVRO, and ORC (to name a few), under a column of type VARIANT. It stores the underlying data in its native format, compressed for efficient storage and access which provides amazing flexibility and frees the data engineer from having to transform the data any further to make it conform to a strict schema definition. In other words, it enables the engineers to not only have access to all data in one place (a single relational table) but with the originality of each data point intact.

So we loaded our disparate datasets to create a pool of data in one Snowflake ‘stage’ table, and then utilized it all to build the necessary operational metrics using Snowflake’s SQL functionality for extracting data from a JSON backed column. The data came from varied sources but we got to treat it all as one table with uniform SQL access patterns. We combined Snowflake with another nifty tool by the name of (Data Build Tool)’ to organize our SQL source and even represent our transformational data pipeline as ‘Select’ SQL to generate the metrics. Dbt handled the DDL for us and in the dialect used by Snowflake.

What About Changes to the Data?

At this point, even if the structure of any of the existing datasets were to change or new variants were to be introduced into the pipeline, nothing would need to be changed in the rest of the pipeline as far as ingesting them goes, given the simple function of data reformatting into JSON is in place. The only change could be adding new columns to the KPI calculations themselves, as-and-when required. This untangled a very messy knot at the beginning of the pipeline, made the overall design more simplified, and modularized — hence easier to maintain and evolve.

Not to forget that Snowflake is an elastic Data Warehouse built on the cloud! So there is literally zero cluster provisioning/tuning required to be done by data engineers while implementing a case.

But, just to be clear, in comparison to the previous solution, there is still a pre-processing step involved to deal with the disparity. But it does not include any complex logic of UNION-izing the datasets by loading them into a data processing engine and dealing with them one-by-one. Only a simple operation that converts the source data format, CSV in our case, into JSON (or any format supported by Snowflake as VARIANT) — something that can be done with a simple serverless processing step.

In Conclusion

There have been numerous NoSQL databases in the market for some time now, which allow data to be stored as documents in more weakly-defined structures, and hence could be helpful when dealing with colorful data. But none of them have proven to provide better data interactivity than a trusty SQL engine, nor with as much ease as Snowflake. Moreover, in the case of Snowflake, the sizes of the datasets which can be ingested potentially have no upper limit in size, and neither do the compute resources that would be needed to dig into them. So there, have your cake and eat it whenever you want too!

Check These Stories Out As Well


Feel free to share on other channels and be sure and keep up with all new content from Hashmap .

Chinmayee Lakkad is a Data Engineering Consultant with working on designing and developing data pipelines on Big Data, Data Warehousing, and Cloud Platforms for the better part of the last 6 years.

HashmapInc

Innovative technologists and domain experts helping accelerate the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our clients by creating smart, flexible and high-value solutions and service offerings that work across industries. http://hashmapinc.com

Hashmap

Written by

Hashmap

Innovative technologists and domain experts accelerating the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our customers http://hashmapinc.com

HashmapInc

Innovative technologists and domain experts helping accelerate the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our clients by creating smart, flexible and high-value solutions and service offerings that work across industries. http://hashmapinc.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade