The saga continues…
In Part I of this series, we met our initial cast of characters. The VARIANT datatype, the COPY command, and how to use the
strip_outer_array parameter when loading XML content into a Snowflake table.
In Part II, the heroes of the story emerged: the SQL functions
XMLGET(). We showed how to use these functions to write standard SQL queries against an XML document stored in a variant column, and even how to reach into deeply nested XML elements.
In Part III, we met the villain: the repeating XML element array! We brought out the big guns — the
LATERAL join and the
FLATTEN() function and they quickly dispatched this baddie. But we had a feeling it would return…
In this final chapter, the repeating array is back — but it’s not alone! It’s brought a few of its friends to cause mayhem and confusion... How will we ever deal with multiple repeating arrays in a single XML document?
OK — that’s a little melodramatic, but I think we’ve set the stage appropriately. Let’s dive in.
Chapters II and III used an example from the Insurance industry, so we’ll continue with that document here, but I’ll need to add a second repeating array. The original document contains a
<Details> node that contains all the
<DetailLine> nodes from the specific claim. That looks like this:
<Details> array starting on line 65 and ending on line 130? There are two
<DetailLine> elements in that array, and each of those has one
<Part> node and two
<Labor> nodes. We’ll be showing how to access data from both the
<Options> and the
<Details> arrays. But first, we need to backtrack a bit and explore the basic difference between relational SQL and hierarchical XML.
The Hierarchical vs. Relational Impedance
On the surface, you’d think that mapping an XML document to a set of relational tables would be simple. An XML “entity” would correspond to a database “table”, and the subnodes or attributes of that entity would become the table’s columns. Subnodes that are repeating arrays would be represented as “child” tables in the database model and would have a primary/foreign key relationship to the “parent” table.
It’s the definition of relationships between these entities that present the challenge. Relational databases are relational because you define “relationships” between two tables based on one or more common columns. These relationships can be used for JOIN operations when writing SQL. XML and other semi-structured formats are hierarchical in nature, meaning that the relationship between a “parent” node and its “child” nodes is defined purely by their positions within the document. An XML element is a child of another because it is physically located “within” the start and end tags of its parent entity. There are no common elements or attributes in the child to join on.
The answer is to replicate, or denormalize, attributes from the parent element down into the definition of the child “table”. In the example above, we’re going to create a table named
Claim_Details that contains data from the
<Details> array, and a table named
Claim_Options that flattens the
<Options> element. Each of those will carry identifying elements (e.g., keys) from their parent elements.
Avoiding Cartesian Products
We’ll build on the techniques using
LATERAL joins that I presented in Part III. However, if we try to build a single query that flattens both arrays, we’ll get the dreaded Cartesian Product! In the example above, there are four “rows” (i.e., subnodes) in the
<Options> array and two rows in the
<Details> array. So the following query returns a total of 8 rows.
That’s clearly not the result we’re after. Imagine if this document had more arrays in it! Our Cartesian Product problem would grow exponentially…
Views to the Rescue!
The easiest way around this is to construct a view of each “path” into the XML. There will be a view that represents the
<Options> array and one for the
<Details> array. But it’s a little more complex than just sticking a
CREATE VIEW in front of each section of SQL. This is where we morph the hierarchical structure of XML into a relational model.
- Figure out what the identifiers are for the parent nodes. Which field(s) represent the “primary key” of each level of the hierarchy above each array? In this example, let’s assume that
<RepairOrderID>is the unique identifier of each
- If you’re working with deeply nested arrays (i.e., GrandParent -> Parent
-> Child), you’ll need a unique key for the intermediate Parent level too.
- Define these fields into the SELECT list of each view definition. These will function as the “foreign keys” for the child tables.
- Feel free to add as many additional non-key fields as you like into the view definition. This is the standard “denormalization” technique, and they’ll save you a lot of extra
XMLGET()calls in your downstream SQL.
Here’s the Options view:
It has the
RepairOrderID from the root node as the primary key. I also added the
Vin field from the
Vehicle node as an example of a non-key attribute in the SELECT list. I don’t think they’d be repairing two different cars on the same repair order…
Here’s the view definition for the
A couple of interesting things going on in here.
- Line 4: repeating the RepairOrderID from the root node.
- Lines 6–8: retrieving values from the nodes below
<DetailLine>by name. Nothing we haven’t seen before.
- Lines 10–12: the
<Part>node is nested below
<DetailLine>so we have to nest the
XMLGET()calls to retrieve those values.
- Lines 14–19: now it gets interesting. The
<Labor>nodes are a repeating group, but they’re not nicely grouped under a single parent node, like
<Options>an element. They’re siblings of every other element under
<DetailLine>. Bad XML hygiene, but that’s just my opinion… This is going to force us to play some tricks in the FROM and WHERE clauses.
- Line 23: flattening the
<Details>array, as expected.
- Line 24: here’s the trick. We can’t use
XMLGET()here, since there’s no “parent” level element above the two
<Labor>nodes that we can flatten. Instead, we use a call to
GET()that returns a reference to an array that contains the immediate children of
<DetailLine>— all eight of them. But we only want the two
- Line 25: the WHERE clause here filters out all the unwanted children of, leaving only the
<Labor>elements, which are referenced with
XMLGET()calls on lines 14–19. (As an experiment, remove the WHERE clause and see what happens to the result).
If the XML had been written with a parent “array” node, say
<LaborDetails> or something, that wrapped the repeating
<Labor> elements, then the SQL would have been able to
XMLGET() that parent node, and wouldn’t have needed a filter in the WHERE clause.
When presented with an XML document that contains multiple unrelated repeating arrays, I recommend using Snowflake views to present them to your users as separate and distinct entities using Snowflake views. Use the standard techniques of walking down the element hierarchy with
XMLGET() calls and using a lateral join to flatten the repeating array. These views can then be joined on their common columns just like any other relational table.
Support for XML in Snowflake is admittedly a little less developed than the other semi-structured formats like JSON and Parquet. I’ve outlined a couple of techniques that I’ve found to help simplify, but don’t let those complexities
Let it Snow!