Working with XML in Snowflake (Part II)

Introduction

In Part I of this series on working with XML data in Snowflake, I gave a little refresher on the basics of XML syntax, and covered the basics of semi-structured data in Snowflake. I showed how to create a table with a VARIANT column, and how to stage an XML file for ingestion. We then looked at two flavors of the COPY statement which loads the file into the Snowflake table, and showed the effect of changing the value of the strip_outer_element parameter.

In this installment, I’ll show some SQL statements you can use to get data out of the XML in a variant column and introduce the GET()and XMLGET() functions. We’re going to need some XML that’s a bit more complex than the AdventureWorks sample from part 1. I used a real-world sample from an Insurance Claims application, scrubbed and simplified. Trust me — the actual file is waaay more complex than this!

To get started, create a table with a single VARIANT column, and load the sample XML.

The GET() function

Unfortunately, Snowflake doesn’t support the simple dot notation syntax to reference XML in a variant column. You extract data with a combination of GET() and XMLGET() function calls. Let’s look at the GET() function first.

The GET() function is used to pluck a named value or attribute out of a variant or object column. GET() takes two arguments, a variant and a string expression. The expression argument can be one of three values:

  1. A single-quoted dollar sign: '$'. This retrieves the value of whatever the variant column represents. This can be a scalar value or another entire variant structure.
  2. A single-quoted @ sign: '@'. This retrieve the name of whatever the variant column represents. This will only return a scalar value.
  3. A node name preceded with an @ sign. This retrieves the value of the named attribute within the XML node.

Here’s a very simple example that retrieves the nodename of the outermost node in the XML. There’s only one, so this returns a single row value. In this query, xmldata references the entire XML document.

Notice that SoftwareVersion is an attribute of the <Estimate> node. We can get that value by passing @SoftwareVersion as the parameter.

Introducing XMLGET()

XMLGET() is a method that grabs an entire subnode from an XML document. The return value is another VARIANT column, so you can pipe that to a GET() method to retrieve the actual data values from that level. Here’s a quick example.

The XMLGET() functions return the named subnodes ( DBDate and TransactionType) as variants, and then the GET() function pulls the values out of those. DBDate and TransactionType are two of the child subnodes of Estimate .

There’s a shortcut syntax for the same exact SQL statement: append colon-dollar sign (in double quotes this time) after the XMLGET()call, like this :"$"

Nesting the XMLGET() calls

Life would be great if all the nodes in an XML document were 1st-level subnodes from the root. But in the real world, XML documents can be deeply nested and highly structured. Critical data can be buried several levels deep. The good news is, Snowflake can handle it! The secret is to “nest” the XMLGET() calls, and walk down the hierarchy step-by-step. If you’re familiar with XPath, it’s fairly similar.

Take a look at the <Policy> node. It is a subnode of <Administrative>, and has 4 subnodes of its own. If we want the data from those four subnodes as separate columns, we nest the XMLGET() calls.

Each XMLGET() call returns a new VARIANT object. XMLGET(xmldata, 'Administrative') returns the entire <Administrative> node, which is passed to the next call, which returns the <Policy> node. And the final calls get the value from the lowest level “leaf” nodes with the :”$" syntax. And each of these values can be individually cast to specific data types.

Now, I’m the first to admit — that’s a lot of typing. It gets far worse when the XML is more deeply nested, and there are LOTS of subnodes. You can easily lose track of the nested calls. Here’s one technique to alleviate the volume of SQL code: Common Table Expressions, or CTE’s. You can define a CTE that represents a specific subnode, and then just reference that CTE in the SELECT list. Here’s an example:

The CTE named POLICY represents the entire <Policy> node in the XML. We’ve named that object XML here, but that name is not really relevant — you can name it anything you want. The CTE and the variant name is referenced in the lower SELECT statement instead of all that repetitive syntax. Much cleaner!

Let’s add a third node into the equation here. We’ll grab some of the values out of the <ClaimInformation> node, which is just after the <Policy> node.

Look at the InspectionDate expression on line 20. It still uses the nested XMLGET() approach but starts walking the hierarchy at the ClaimInformation node. Think of the CTE as a “shortcut” to a commonly used parent node.

Repeating XML arrays

Look lower in the XML structure, and you’ll see the <Options> node. That is an array of <VehicleOption> nodes. In the next installment of this series, I introduce the LATERAL keyword and the FLATTEN() function, which we’ll use to pivot that array into a table structure.

Let it Snow!!

--

--