Hybrid Relational JSON Tables in Snowflake

Have you ever had a new-field or field-rename break your ETL pipelines to the Data Warehouse and ultimately your consumers of data? If you have ever done any data project, the answer is undoubtedly “yes”. To solve this ages-old problem with data, the confluence of some tenants of Hadoop’s schema-on-read and Snowflake’s Architecture separation come together in beautiful harmony on the Snowflake Data Cloud. In Snowflake, you can have a hybrid relational and JSON table in the same table. I have customers doing this to the tune of 70,000,000,000 (seventy trillion) records in one table, and it scales beautifully in a non-brittle fashion via ELT.

What do Hybrid Tables Enable?
You get all the goodness of a relational table with RBAC and ACID without the fragility of data types, schema evolution, and failed pipelines for mostly cosmetic reasons. If the data is bogus, you have the ability to transform in the database. This ELT evolution allows you to use the power of Snowflake’s compute and time-travel on files to undo data issues quickly. This core tenant is huge when it comes to Snowflake because you can elastically scale up thousands of CPUs to unwind the issue in minutes instead of days with the bottleneck of an ETL tool. Further, storing most fields as JSON instead of columns allows for schema evolution to flow through the database instead of being stopped by the database. Most of the fields in a table are never used in a SELECT clause anyway. Let them live on the long tail in JSON and move important fields, key fields, and dates to columns for ease of query and joins. When new fields appear in your JSON payload without notice, no worries, it’s already in the database and queryable via v.parent.child:new_field::cast. Ultimately, this allows for incredibly fast ingest to the database to the tunes of millions of events per second. Take the event from the message broker, store it in a hybrid table, and use views to unwind the JSON for analytics. This will increase your data agility by order of magnitude and make the database the star of the show instead of the constant perceived bottleneck to progressing your organization into real-time data.

Two distinct ELT patterns will start to emerge from raw JSON to ready to report tables and views. A pull-through or a push-through from raw to conformed to the reporting objects. For ELT pattern 1 (pull), the BLOB storage is the raw zone. Snowpipe unwinds the JSON a little with further views ready for reporting. For ELT pattern 2 (push), Snowflake stores data in a raw hybrid table. Then, MERGE logic unwinds the data to tables and views, ready for reporting. Both patterns are viable, and the complexity of the Transform logic will drive your choices. The higher the complexity, the more likely you are for pattern 2 (push) as MERGE logic can support more complex transforms and UPSERTS.

ELT Pipeline Pattern 1: Push JSON to S3 and Pull from Snowpipe with the Transform Logic
ELT Pipeline Pattern 2: Push JSON through to Metadata / JSON tables in Snowflake and Transform in MERGE Logic

Hybrid Table Design Patterns
Some level of schema design is necessary for our hybrid tables. Typically, it is best to make cluster, join and common where predicate fields into full-fledged columns Snowflake. Further, it is good to pull out metadata like the file dts, insert dts, file name, file path, and create a sequence or UUID on the table. Leave everything else as JSON. Let me say that again, leave everything else as JSON. Snowflake does something very special to JSON and other variant types as you ingest. It creates a full columnarized dictionary over the JSON, greatly speeding up query times on the JSON document. This means you get MPP column performance without the constraints of a physical schema. Here is a taste of what Snowflake collects even for this simple table. This unseen metadata dictionary step allows JSON hybrid tables to scale and perform to trillions of rows and petabytes in Snowflake scale.

Public Example
This is great; I want to try it! Yes, yes you do, and I have something special for the Holidays for you. I extended a simple example of my friend Chris Richardson’s hard work. First, we will create the database, file format, and stage for the 11 JSON files I am hosting on public S3.

This is why you love Snowflake; even the file format object is a JSON hybrid, so much more agile and best practice following!

Now we can create the hybrid table using the special Variant type in Snowflake (further reading).

Columns and JSON coexisting without pandemonium.

The final step is to use Snowflake’s dot notation over JSON to create an analytics-ready view. As new data lands on the hybrid table, it is instantly selectable over the view… it’s that simple.

Analytics ready, stored once as JSON, scales to trillions of records and petabytes.

Conclusion

First and foremost, let the Data Cloud do the hard work. Ingest data as relational JSON hybrid tables, and Snowflake will columnarize the data without typing SQL for days. Store the important fields as columns and leave the rest to Snowflake by keeping everything else as JSON without sacrificing performance. By merging the tenants of schema-on-read, BLOB storage, smart metadata along with ACID, cloud-first architecture; is what allows Snowflake to scale up to webscale. Happy JSON Querying!

Links & Resources

To learn more about this and other topics, check out a few helpful resources:

Originally published at http://bigdatadave.com on December 24, 2020.

--

--

David A Spezia
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

David spends his time working with Web-Scale Data Applications built on Snowflake with the largest tenants on the Snowflake Data Cloud.