The art and science of nested, polymorphic JSON transformation in Snowflake
How to dynamically transform JSON into actionable insights by leveraging the Snowflake Native App Framework
Written by Jon Farr, Founder and CEO of TDAA!, in partnership with Andrew Curran, Founder & COO of TDAA!, and Cameron Wasilewsky, Senior Sales Engineer at Snowflake
Update: the Pancake app (and a free FDA dataset) is now available in the Snowflake Marketplace! The app is fully functional so you can scan, flatten, and relate your JSON data today! The app package contains a sample JSON related to real estate as well as SQL code to create Dynamic Tables from the data. The app’s readme contains quick start guides for the sample data and one for the FDA data to let you explore in depth.
During my journey into the depths of data, I’ve encountered every kind of data-related challenge you can think of. Often those challenges are specific to the environment, but the frustration of processing nested hierarchical data that can be polymorphic has been one of the most challenging in my entire career. This is one of my industry’s most intricate problems, affecting database management systems (DBMS) and advanced programming paradigms in equal measure. We’ve made solving this problem our business at TDAA! with our product, Pancake. My hope is that I can effectively communicate my passion for revealing the complexities hidden in these data structures based on the decades I’ve spent focused on information and data architecture.
The Problem
Our adventure began when I started working with my new co-founder Andrew, and we set out to address the issue of hallucinations in gen AI by focusing on solutions designed to improve the quality of data made available to an LLM. As part of this process, we chose to work with the FDA’s Medical Device data, a free dataset provided by the US government (a dataset that Paul Horan has documented working within his excellent blog series on working with large JSONs in Snowflake).
In working with government data and a very well-defined data dictionary the last thing I expected to encounter was polymorphic data. While flattening the nested JSON structure I discovered that some of the array attributes could contain primitive values such as string or could contain full objects with nested arrays. This was a complex problem that needed a more dynamic and eloquent solution.
I have always appreciated the flexibility and advantages of hierarchical data formats like JSON , but I am also very aware of the technical debt they can bring to application and data warehouse databases. That flexibility can inadvertently introduce layers of complexity, particularly when it involves issues like polymorphic attributes or nested hierarchical data structures.
Manually parsing a large, deeply nested, polymorphic JSON is incredibly cumbersome. It is tedious, complicated, and error-prone, but this kind of data is pivotal for analytics and product development. Unfortunately, the challenges of polymorphic forms and the implications of schema evolution often make the opportunity cost of activating this data too high.
The ubiquity of hierarchical data as a storage format and the likelihood of a given data source containing these complex issues is precisely what we’ve aimed to simplify with Pancake.
Our goal is to unlock the value of this kind of data by transforming it into a clearly defined structure with visibility into all attributes at every level, regardless of the data types used for each attribute over time.
We are thinking of the many engineers and analysts who are negatively impacted by a lack of access to data. The time it takes to unpack this data, or deal with its polymorphic nature, means that they most likely do not have access to all of the data. This problem impacts application development as well because attributes in a document database frequently change — but if data migrations are not done successfully, the quality of the application can be impacted tremendously.
Technical Deep Dive
Our architectural blueprint was centered on thoroughly analyzing a data source containing schema-less, semi-structured JSON data, including its nested hierarchy and the polymorphic state of every attribute. We developed a system that identifies and adapts to the data’s evolving structure, ensuring our output remains accurate and relevant. This system, integrated with the Snowflake Native App Framework, will operate seamlessly within the customer’s environment, maintaining data integrity and security.
We follow four steps to process any dynamic complex JSON structure, enabling the extracting and flattening of that data into a relational table format.
Step 1 — Scan a data source containing JSON objects and discover the schema and any polymorphic attributes
Our application scans a Snowflake variant column, which has rows of JSON objects to discover every attribute, by recursively traversing the entire object in each row to tackle the intricacies of schema-less, hierarchical data. During this discovery phase, our app generates detailed metadata for each attribute, accommodating any polymorphism observed in such data.
Step 2 — User configuration of metadata
Users are empowered to configure some of the metadata generated in Step 1, including optionally overriding the inferred Snowflake data type, precision, scale, and datetime format. For all array attributes, the user will be able to specify the natural foreign keys by selecting parent-level attributes. These natural foreign keys will be used to create relational dynamic tables.
The system recognizes and adapts to the various data types an attribute might embody. It intelligently infers the appropriate Snowflake data type for each piece of data, drawing on a sample anonymized value. This process includes discerning a date or datetime data type from a string based on varied ISO standard formats.
Step 3 — Generate SQL to create nested Snowflake Dynamic Tables and a consolidated view of the JSON schema all polymorphic versions for each attribute
Our tool uses the generated and user-configured metadata from Steps 1 and 2 to generate SQL statements to establish a series of nested Snowflake Dynamic Tables. These Dynamic Tables serve as the structured, accessible outcome of what was once a tangled web of nested and polymorphic data. Each polymorphic attribute will be defined in the Dynamic Table as one or more columns based on each data type used by the attribute.
Users can then create select statements by joining these dynamic tables at various levels by using the natural foreign key attributes created during the configuration process, enhancing downstream data transformation, interconnectivity, and analysis.
Our app will also generate a consolidated view of the discovered JSON schema and all polymorphic versions of an attribute with their associated data type.
Step 4 — Post-flattening: Monitor and maintain
A monitoring feature is integral to our app. It ensures that the system remains current with the data it processes. It alerts users to schema modifications in the source data, prompting them to initiate the Dynamic Table SQL generation process. Once generated, the user can deploy the newly updated SQL to update the Dynamic Tables to reflect the latest data structure, maintaining the accuracy and relevance of the transformed data.
Our game changers
Developing SQL code to create Snowflake Dynamic Tables dynamically
Our app revolutionizes JSON data parsing, extracting, relating, and flattening by generating SQL code to produce Snowflake Dynamic Tables that adapt dynamically to the ever-changing structure of nested data, including the support for polymorphism at the attribute or object level. This approach streamlines the transformation process and ensures that the data remains accurate and reliable, irrespective of its evolving nature.
Each attribute can take the shape of three major types of data: primitives (string, decimal, bool, etc), arrays (primitive or object), and objects (which can contain additional nested objects or arrays).
Our app generates Dynamic Table create statements with column definitions to support each type of data an individual attribute has ever used.
So if one attribute takes the shape of four different data types, then one column will be defined for each data type contained in a single attribute. In our example, you will see address_str and address_array_object. This allows data engineers and analysts to begin working with data immediately and clearly understand the shape of the data. Additional transformations are much simpler with these types of column definitions.
Monitoring and automation
Pancake has the ability to monitor data sources and track changes for each configured data source. This enables the additional user configuration of any detected changes so the user can then regenerate the SQL code necessary to update the Dynamic Table definitions through a versioning system. The monitoring capability eliminates issues caused when teams are not alerted to upstream system data schema changes, ensuring the data processing remains uninterrupted and accurate, which enhances operational efficiency.
By monitoring the raw data source through a Snowflake stream, Pancake can report any data changes to the raw data sources, allowing for additional user configuration and SQL code generation. You will be able to set up monitoring and generation for multiple environments such as dev, test, staging, production, etc.
Performance optimization
We are also working several performance optimization strategies to address the challenge of processing large volumes of complex data. Leveraging Snowflake’s advanced features, the team will ensure that Pancake discovers data with minimal latency, maximizing efficiency and providing a seamless user experience.
Why Snowflake?
Leveraging Snowflake as our platform is a strategic decision. The robust capabilities for managing semi-structured data offered the perfect foundation for Pancake. It leverages Snowflake’s variant data type, enabling us to dynamically identify and catalog every attribute within the nested JSON and adapt our approach to address the data’s polymorphic nature. Snowflake’s new Dynamic Tables allow us to create streams that extract, relate, and flatten the hierarchical and polymorphic data into clearly defined structures that bring accurate visibility into every possible shape the data source may represent. These streams enable accurate and complete use for further transformations or analysis.
Further, the Snowflake Native App Framework and Snowflake Marketplace simplify our go-to-market approach; we do not need to worry about distribution, security, or licensing.
The other thing that’s been really helpful during our development is all of the support that we’re getting through the Powered by Snowflake Funding Program. As we’ve been building out the product, we’ve had biweekly meetings with their technical team as well as a Slack channel where we can ask ad hoc questions. This has been really helpful during our design and build phase. As we get into testing and launching, the program has also helped us find potential design partners and customers through direct introductions, co-selling motions, and co-marketing opportunities.
Our story with Snowflake is more than just about building an application; it’s a narrative of innovation, a testament to the power of technical expertise combined with a clear vision and unwavering commitment to solving the puzzles presented by complex data.
What’s next?
Updated: Pancake is now live in the Snowflake Marketplace, so you can download it into your Snowflake environment and scan your complex JSON data, perform in-depth schema analysis, and generate SQL to extract, relate, and flatten your data in Dynamic Tables. Looking ahead, we’re hard at work adding support for embedded or stringified JSON, Parquet, and Avro! Watch this space for more on that topic in a future post, and stay tuned for more insights into TDAA!’s technical odyssey and Snowflake Native App. In the meantime, reach out; we will happily chat. You can also contact us directly at jon@tdaa.ai or andrew@tdaa.ai or you can learn more at www.datapancake.com.