Any database can handle strongly-typed, or “structured” data. With structured data, every column is assigned a specific datatype (e.g., date, string, number, timestamp, etc.), and each column can only contain data that conforms to its specified datatype. This has been a key aspect of relational databases from, like, forever. But in today’s world, data comes in all shapes and sizes, including the “semi-structured” formats of JSON, XML, Avro, ORC, and Parquet. (We’ll leave the discussion of “unstructured” data to another post…)
Many organizations now depend on these “semi-structured” formats for critical business processes. The natural impedance that exists between the “structured” world of relational databases and these “semi-structured” data formats has been a barrier to analyzing this data. It’s the proverbial “square peg in a round hole” problem. The solution to date has been to build brittle and complex transformation processes that “flatten” the nested semi-structured data into strongly-typed relational tables and columns.
One of the critical differentiators for Snowflake as a cloud data platform is its native support for semi-structured data, using our VARIANT datatype. A variant column can store an entire semi-structured document, in its native shape, with zero transformation required. A simple dot notation syntax allows SQL access to any value, anywhere in the document. Detailed analytics at scale, against data in semi-structured formats, without complex pre-load transformation routines, is now possible.
There are some syntactical differences between the various semi-structured data formats when working them in Snowflake. This series of articles will focus specifically on XML. This first article will start with the basics, so it should be a little bit remedial for some. The next articles will work up to some very detailed examples.
The first step in working with XML is to understand how the entities and attributes are organized inside the document. Apologies if this is a bit remedial — if you’re already XML-aware, feel free to skip ahead!
An entity consists of an opening bracket
<IAmAnEntity>and a corresponding closing bracket
</IAmAnEntity>(note the forward slash just after the
<in the the closing bracket). The node’s “content” is all the stuff that appears between the opening and closing brackets. Entity names are case-sensitive, and always come in pairs. It’s not valid XML if they don’t. You’ll sometimes hear “entities” called “nodes”. Same thing.
The value of an entity is all the stuff that appears between the opening and closing brackets. Sometimes, it’s a single string value, like this:
Other times, it’s an entirely set of new entities.
The value of an entity has to be either a scalar value or a well-formed XML fragment itself. So this is NOT valid XML:
Attributes, when they are used, will appear inside the opening bracket of the entity, and will consist of a name, an equals sign, and a double-quoted string value, like this:
So a single entity can have both a value (either scalar or nested XML fragment), AND one or more attributes.
Loading XML data into Snowflake
The first step in doing analysis against XML data files is to get that data into Snowflake. There’s a little setup required, so let’s jump into that. We’ll need a table with a VARIANT column. I’m assuming you have the SYSADMIN role, or at least a role that has CREATE DATABASE and CREATE TABLE privileges.
Step 2: get some XML. For the purposes of this blog, we’ll be working with the Microsoft AdventureWorks sample XML file. Copy that XML and save it as a file on your local filesystem. Call the file AdventureWorks.xml.
Step 3: stage the file into Snowflake. This will use the SnowSQL commandline utility. If you don’t have that installed, do that now. Open a commandline connection to your Snowflake account, and PUT the AdventureWorks.xml file into the AdventureWorks table stage.
Step 4: load the staged XML file into the AdventureWorks table. This uses the COPY command, and you can either do this from your open SnowSQL session, or open the Snowflake UI. Before you do that, there’s a critical decision that must be made: does the outermost element of the XML contain important data or not? In most cases, like SOAP messages, the outermost node is just a “container”, and has no usable data of its own. The Snowflake parameter that controls this with XML data is strip_outer_element. When this is false, then entire XML document gets loaded into a single row. When it’s true, that outermost node is discarded, and each “subnode” is loaded into its own row of the table.
Let’s look at an example. The next two statements load the staged XML file into the table, keeping the outermost node intact. I’m running this in the Web UI, so that I can see the output of the SELECT statement.
Look how the <root> node is still there, and the entire document was loaded into a single row. That may or may not be what you’re looking for.
Now look what happens when we add strip_outer_element=TRUE.
This strips out the useless <Root> node, and loads each of its immediate child nodes as separate rows in the AdventureWorks table. We have one row for the <Customers> structure, and one row for the <Orders> structure. This will be much nicer to work with.
This post just covered the basics of getting XML data loaded into Snowflake. The next few blog posts will use this as the basis for the SQL statements against the AdventureWorks table, using the GET() and XMLGET() functions. Stay tuned.
Let it Snow!