In Part I of this series on working with XML data in Snowflake, I gave a little refresher on the basics of XML syntax, and covered the basics of semi-structured data in Snowflake. I showed how to create a table with a
VARIANT column, and how to stage an XML file for ingestion. We then looked at two flavors of the COPY statement which loads the file into the Snowflake table, and showed the effect of changing the value of the strip_outer_element parameter.
In this installment, I’ll show some SQL statements you can use to get data out of the XML in a variant column and introduce the
XMLGET() functions. We’re going to need some XML that’s a bit more complex than the AdventureWorks sample from part 1. I used a real-world sample from an Insurance Claims application, scrubbed and simplified. Trust me — the actual file is waaay more complex than this!
To get started, create a table with a single VARIANT column, and load the sample XML.
The GET() function
Unfortunately, Snowflake doesn’t support the simple dot notation syntax to reference XML in a variant column. You extract data with a combination of
XMLGET() function calls. Let’s look at the
GET() function first.
The GET() function is used to pluck a named value or attribute out of a variant or object column.
GET() takes two arguments, a variant and a string expression. The expression argument can be one of three values:
- A single-quoted dollar sign:
'$'. This retrieves the value of whatever the variant column represents. This can be a scalar value or another entire variant structure.
- A single-quoted @ sign:
'@'. This retrieve the name of whatever the variant column represents. This will only return a scalar value.
- A node name preceded with an @ sign. This retrieves the value of the named attribute within the XML node.
Here’s a very simple example that retrieves the nodename of the outermost node in the XML. There’s only one, so this returns a single row value. In this query,
xmldata references the entire XML document.
SoftwareVersion is an attribute of the
<Estimate> node. We can get that value by passing
@SoftwareVersion as the parameter.
XMLGET() is a method that grabs an entire subnode from an XML document. The return value is another VARIANT column, so you can pipe that to a
GET() method to retrieve the actual data values from that level. Here’s a quick example.
XMLGET() functions return the named subnodes (
TransactionType) as variants, and then the
GET() function pulls the values out of those.
TransactionType are two of the child subnodes of
There’s a shortcut syntax for the same exact SQL statement: append colon-dollar sign (in double quotes this time) after the
XMLGET()call, like this
Nesting the XMLGET() calls
Life would be great if all the nodes in an XML document were 1st-level subnodes from the root. But in the real world, XML documents can be deeply nested and highly structured. Critical data can be buried several levels deep. The good news is, Snowflake can handle it! The secret is to “nest” the
XMLGET() calls, and walk down the hierarchy step-by-step. If you’re familiar with XPath, it’s fairly similar.
Take a look at the
<Policy> node. It is a subnode of
<Administrative>, and has 4 subnodes of its own. If we want the data from those four subnodes as separate columns, we nest the
XMLGET() call returns a new VARIANT object.
XMLGET(xmldata, 'Administrative') returns the entire
<Administrative> node, which is passed to the next call, which returns the
<Policy> node. And the final calls get the value from the lowest level “leaf” nodes with the
:”$" syntax. And each of these values can be individually cast to specific data types.
Now, I’m the first to admit — that’s a lot of typing. It gets far worse when the XML is more deeply nested, and there are LOTS of subnodes. You can easily lose track of the nested calls. Here’s one technique to alleviate the volume of SQL code: Common Table Expressions, or CTE’s. You can define a CTE that represents a specific subnode, and then just reference that CTE in the SELECT list. Here’s an example:
The CTE named
POLICY represents the entire
<Policy> node in the XML. We’ve named that object
XML here, but that name is not really relevant — you can name it anything you want. The CTE and the variant name is referenced in the lower SELECT statement instead of all that repetitive syntax. Much cleaner!
Let’s add a third node into the equation here. We’ll grab some of the values out of the
<ClaimInformation> node, which is just after the
Look at the
InspectionDate expression on line 20. It still uses the nested
XMLGET() approach but starts walking the hierarchy at the
ClaimInformation node. Think of the CTE as a “shortcut” to a commonly used parent node.
Repeating XML arrays
Look lower in the XML structure, and you’ll see the
<Options> node. That is an array of
<VehicleOption> nodes. In the next installment of this series, I introduce the LATERAL keyword and the FLATTEN() function, which we’ll use to pivot that array into a table structure.
Let it Snow!!