Working with XML in Snowflake: Part III
How to navigate repeating groups in XML structures with the FLATTEN function and LATERAL join method.
This is the third in a series of posts that cover tips and techniques for working with XML-based data in Snowflake. Part 1 of the series was a general introduction to Snowflake’s support for semi-structured data formats, and how to use the
COPY statement to load XML files into a Snowflake
VARIANT column. Part 2 of the series introduced the SQL syntax for retrieving data out of a variant column containing XML, using the
XMLGET() functions. I showed how to walk the XML node hierarchy by “nesting” the
In this installment, I’ll show how to navigate repeating groups in XML structures. We’ll cover the
FLATTEN() function that transforms arrays into a table structure, and the
LATERAL join method for adding those flattened arrays into the
We’ll use the same XML doc that was presented in part 2 — the Insurance Claim document. If you’ve read part 2, then you should already have the Claim table defined and loaded with that XML. If not, here’s that code snippet:
Look at the XML starting at line 46. There’s a node named
<Options> and it contains a repeating group of
<VehicleOption> nodes. The actual data values are in the
<OptionCode> and the
<OptionDescription> nodes. What we want to do is write an SQL query that shows the entire list of OptionCode and OptionDescription values for this Claim.
The FLATTEN function
FLATTEN() is quite a versatile function, and can be used in pretty much any situation where you have a list of compound values and need an in-line relational representation of that data. The Snowflake docs show how it can be used to transform semi-structured JSON data into a table structure, but that isn’t really helpful when you’re working with XML...
flatten() with XML is actually pretty simple. The only argument it needs is a reference to a variant that contains the “parent” node of the repeating array. Part 2 showed you how to get a reference to any node in the XML document using the
XMLGET() function. The tricky part is finding the right parent node to flatten!
The authors of this XML document made that job very easy by creating some well-structured XML. They have a “wrapper” node called
<Options>, and put the repeating array underneath that level. That’s the one we want to flatten. The syntax for that call looks like this:
FLATTEN(XMLGET(XMLGET(xmldata, 'Vehicle'), 'Options'):"$")
This uses the nested
XMLGET() technique that we covered in part 2. The innermost call gets a reference to the
<Vehicle> node, which is used to get a reference to its immediate child node,
<Options>. And the
:"$" shortcut syntax saves us a
GET() call to retrieve the value of the node, which is the reference to the entire repeating array. The output of the flatten() function is a tabular structure containing 6 fixed columns:
- SEQ: a unique sequence number associated with the input record
- KEY: the key to the exploded value
- PATH: the path to the element within a data structure
- INDEX: the index of the element, if it is an array; otherwise NULL
- VALUE: the value of the element of the flattened array/object
- THIS: the element being flattened
Let’s see what that’s all about.
The LATERAL keyword
Now that we have a repeating array flattened into a relational table structure, you’d think we could just reference it in the
FROM clause like any other table. That’s not exactly true here… We need to perform a lateral join to that expression. Adding the
LATERAL keyword in the from clause tells Snowflake to create an inline view, generated from each row in the “outer” table of the query. So for each row in CLAIM, perform the XMLGET() calls and get a reference to the
<Options> array. There’s only one row in our CLAIM table, but you could envision a table full of individual claim records, each containing a full XML document for that claim.
The columns in that flattened table structure are the six listed above. This SQL reveals the structure created by the flatten() function.
We see 32 rows (index  to ) which matches the content of the XML, and the
VALUE column is a variant that contains the XML structure of each
<VehicleOptions> node. The
VALUE column is really the only reference we’ll ever need going forward. We’ll use that to extract the data values out of each subnode. That query looks like this:
A couple of things to note here:
- The Flatten() expression has been given an alias:
- The expressions in the SELECT list reference that alias and its
valuecolumn, which was created by the
FLATTEN()call. That’s a variant, and we can pass it to
XMLGET()to retrieve the data values from its subnodes.
When the XML is not so nice
We got lucky. The author of this XML used a nice “wrapper” node named
<Options> to encapsulate the array of
<VehicleOptions>. Not every XML document is going to cooperate like this. Remember that you “flatten” the parent node to reference the array of its children. When that wrapper node doesn’t exist, the repeating array nodes are commingled with other non-array nodes. I simplified the original CLAIM table syntax to illustrate that concept:
See how the
<VehicleOption> nodes starting on line 31 are siblings of the other nodes underneath
<Vehicle>? That’s ugly, but perfectly valid XML syntax! So how do we deal with that? The answer is to use the WHERE clause to filter out the nodes we don’t care about, like so:
- The FLATTEN() expression in the FROM clause no longer references the
<Options>node, because it doesn’t exist in the XML.
<VehicleOption>is now a direct child of
<Vehicle>. When we flatten that parent node, we’ll get references to all of its immediate children, including
<LicensePlateNumber>, and so on. We don’t want those!
- The WHERE clause now contains a filter that retrieves the name of the child nodes (using the
'@'argument to the GET() call), casts it as a string, and filters out the node names we don’t want. Remove that WHERE clause and see what happens…
XML documents can (and often do) have more than one repeating array in them. Don’t try to LATERAL FLATTEN() more than one nested array in a single query, unless they’re in the same XML path. These are “cross-joins”, otherwise known as Cartesian Products, which are probably not what you’re looking to achieve. In the next installment, we’ll go even deeper and show how to work with XML that contains multiple nested arrays, without the dreaded Cartesian Product.
Let it Snow!!