Loading XML Files into Snowflake with Snowpark Python

What is the difference between XML elements, attributes, and values anyway?

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min readSep 1, 2023

Over my entire career working with data platforms, I’ve probably loaded XML files more than any other kind, and I never enjoyed it. My good friend Bruno Borges loves XML and hates YAML, and you can see him poking the bear (I think he’s playing a character honestly) on his Twitter (or X) feed. I don’t think anyone reading this is interested in the history of XML, nor do I care to research it, so let’s just skip that. XML’s use as a configuration format has mostly been replaced by YAML, HCL, and TOML, and JSON and Parquet have superseded it as a data storage/transfer format. But legacy processes exist, and most of the 30 or so (if memory serves) cloud migrations that I’ve led or worked on involved some XML.

Am I saying that XML is like the Mafia? Yes… Yes, I am.

In my last two articles, I first described the Snowpark Python CLI I created for loading the TPC-DI benchmark dataset into Snowflake and then dove deep into fixed-width, multi-format file types.

Loading the TPC-DI Benchmark Dataset into Snowflake

Prebuilt Python Snowpark Loader with dbt Models

medium.com

Loading Fixed-Width, Multi-Format Files with Python Snowpark

I know these are less common than they used to be, but trust me: they’re out there.

medium.com

In this article, I want to focus on using Snowpark Python to load XML files into Snowflake.

The Customer Management File

There’s only one XML file in the TPC-DI benchmark, the CustomerMgmt.xml file, and it’s meant to represent a feed from a CRM system. It has attributes and nested elements, and only leaf nodes have discreet values. The outer-most element is called TPCDI:Actions which is really like a table name. It describes the records in the file, and as you’ll see when we load it, we strip it out completely so that Snowflake doesn’t think the entire file is a single record with nested objects. Each “record” in the file is of type TPCDI:Action and has one nested element called Customer which in turn has several nested elements like Name, Address, ContactInfo, etc. Below is an example of the CustomerMgmt.xml file with only a single record in it:

<?xml version="1.0" encoding="UTF-8"?>
<TPCDI:Actions xmlns:TPCDI="http://www.tpc.org/tpc-di">
 <TPCDI:Action ActionType="NEW" ActionTS="2007-07-07T02:56:25">
  <Customer C_ID="0" C_TAX_ID="923-54-6498" C_GNDR="F" C_TIER="3" C_DOB="1940-12-02">
   <Name>
    <C_L_NAME>Joannis</C_L_NAME>
    <C_F_NAME>Adara</C_F_NAME>
    <C_M_NAME/>
   </Name>
   <Address>
    <C_ADLINE1>4779 Weller Way</C_ADLINE1>
    <C_ADLINE2/>
    <C_ZIPCODE>92624</C_ZIPCODE>
    <C_CITY>Columbus</C_CITY>
    <C_STATE_PROV>Ontario</C_STATE_PROV>
    <C_CTRY>Canada</C_CTRY>
   </Address>
   <ContactInfo>
    <C_PRIM_EMAIL>Adara.Joannis@moose-mail.com</C_PRIM_EMAIL>
    <C_ALT_EMAIL>Adara.Joannis@gmx.com</C_ALT_EMAIL>
    <C_PHONE_1>
     <C_CTRY_CODE>1</C_CTRY_CODE>
     <C_AREA_CODE>872</C_AREA_CODE>
     <C_LOCAL>523-8928</C_LOCAL>
     <C_EXT/>
    </C_PHONE_1>
    <C_PHONE_2>
     <C_CTRY_CODE/>
     <C_AREA_CODE/>
     <C_LOCAL>492-3961</C_LOCAL>
     <C_EXT/>
    </C_PHONE_2>
    <C_PHONE_3>
     <C_CTRY_CODE/>
     <C_AREA_CODE/>
     <C_LOCAL/>
     <C_EXT/>
    </C_PHONE_3>
   </ContactInfo>
   <TaxInfo>
    <C_LCL_TX_ID>CA3</C_LCL_TX_ID>
    <C_NAT_TX_ID>YT3</C_NAT_TX_ID>
   </TaxInfo>
   <Account CA_ID="0" CA_TAX_ST="1">
    <CA_B_ID>17713</CA_B_ID>
    <CA_NAME>CJlmMuFyibKOmKLHIaTeWugvCgZdmcfpDsYb</CA_NAME>
   </Account>
  </Customer>
 </TPCDI:Action>
</TPCDI:Actions>

The hierarchical representation of a TPCDI:Action record, with @ signifying a node attribute as opposed to an element, is shown below:

|-- TPCDI:Action
    |-- @ActionType: string
    |-- @ActionTS: timestamp
    |-- Customer
        |-- @C_ID: number
        |-- @C_TAX_ID: string
        |-- @C_GNDR: string
        |-- @C_TIER: number
        |-- @C_DOB: date
        |-- Name
            |-- C_F_NAME: string
            |-- C_L_NAME: string
            |-- C_M_NAME: string
        |-- Address
            |-- C_ADLINE1: string
            |-- C_ADLINE2: string
            |-- C_CITY: string
            |-- C_CTRY: string
            |-- C_STATE_PROV: string
            |-- C_ZIPCODE: string
        |-- ContactInfo
            |-- C_ALT_EMAIL: string
            |-- C_PHONE_1
                |-- C_AREA_CODE: number
                |-- C_CTRY_CODE: number
                |-- C_EXT: long
                |-- C_LOCAL: string
            |-- C_PHONE_2
                |-- C_AREA_CODE: number
                |-- C_CTRY_CODE: number
                |-- C_EXT: number
                |-- C_LOCAL: string
            |-- C_PHONE_3
                |-- C_AREA_CODE: number
                |-- C_CTRY_CODE: number
                |-- C_EXT: number
                |-- C_LOCAL: string
        |-- TaxInfo
            |-- C_LCL_TX_ID: string
            |-- C_NAT_TX_ID: string
        |-- Account
            |-- CA_B_ID: number
            |-- CA_NAME: string
            |-- @CA_ID: number
            |-- @CA_TAX_ST: number

Why do XML designers use attributes as opposed to elements? Honestly, there is no hard and fast rule, and similar decisions must be made with NoSQL and document database design. Fortunately, Snowpark makes it easy to get these objects regardless of how they were originally modeled in the XML file.

Uploading XML Files to a Stage

As mentioned in the previous articles, I used a credentials.json file to store my credentials, and it looks like this:

{
    "account": "myaccount",
    "user": "myuser",
    "password": "mypassword",
    "role": "myrole",
    "warehouse": "stewart_dev",
    "database": "tpc_di",
    "schema": "digen"
}

The code samples below are simplified snippets from the tpcdi.py loader in the GitHub repository with abstractions removed so they are easier to follow. We first make our connection, then glob the CustomerMgmt.xml file, and then put it in the stage:

import sys, json
from snowflake.snowpark import Session
from snowflake.snowpark.types import *
from snowflake.snowpark.functions import *
from pathlib import Path

# Read the credentials.json file
with open("credentials.json") as jsonfile:
    credentials_dict = json.load(jsonfile)

# build the session
session = (
    Session
    .builder
    .configs(credentials_dict)
    .create()
)

# glob the file
pathlist = (
    Path('/Users/stewartbryson/dev/tpcdi-output/Batch1')
    .glob("CustomerMgmt.xml")
)
stage_path = "@tpcdi/Batch1/"

for file in pathlist:
    # put the file(s) in the stage
    put_result = (
        session 
        .file
        .put(
            str(file), 
            stage_path,
            parallel=4, 
            auto_compress=True,
            overwrite=False
        )
    )
    for result in put_result:
        print(f"File {result.source}: {result.status}")

File CustomerMgmt.xml: UPLOADED

Reading an XML File into a Snowpark DataFrame

Let’s use the xml file reader to load the CustomerMgmt.xml file into a DataFrame and show it:

# Read the XML file into a DataFrame and show it
df = (
    session
    .read
    .option('STRIP_OUTER_ELEMENT', True) # Strips TPCDI:Actions
    .xml(f"{stage_path}CustomerMgmt.xml")
    .show(1, 100)
)

Notice that we passed the STRIP_OUTER_ELEMENT option to remove the outer TPCDI:Actions node so that Snowflake doesn’t consider the entire file as one nested record. We’re also showing only a single row so that the output is agreeable:

-----------------------------------------------------------------------------------------
|"$1"                                                                                   |
-----------------------------------------------------------------------------------------
|<TPCDI:Action ActionTS="2007-07-07T02:56:25" ActionType="NEW">                         |
|  <Customer C_DOB="1940-12-02" C_GNDR="F" C_ID="0" C_TAX_ID="923-54-6498" C_TIER="3">  |
|    <Name>                                                                             |
|      <C_L_NAME>Joannis</C_L_NAME>                                                     |
|      <C_F_NAME>Adara</C_F_NAME>                                                       |
|      <C_M_NAME></C_M_NAME>                                                            |
|    </Name>                                                                            |
|    <Address>                                                                          |
|      <C_ADLINE1>4779 Weller Way</C_ADLINE1>                                           |
|      <C_ADLINE2></C_ADLINE2>                                                          |
|      <C_ZIPCODE>92624</C_ZIPCODE>                                                     |
|      <C_CITY>Columbus</C_CITY>                                                        |
|      <C_STATE_PROV>Ontario</C_STATE_PROV>                                             |
|      <C_CTRY>Canada</C_CTRY>                                                          |
|    </Address>                                                                         |
|    <ContactInfo>                                                                      |
|      <C_PRIM_EMAIL>Adara.Joannis@moose-mail.com</C_PRIM_EMAIL>                        |
|      <C_ALT_EMAIL>Adara.Joannis@gmx.com</C_ALT_EMAIL>                                 |
|      <C_PHONE_1>                                                                      |
|        <C_CTRY_CODE>1</C_CTRY_CODE>                                                   |
|        <C_AREA_CODE>872</C_AREA_CODE>                                                 |
|        <C_LOCAL>523-8928</C_LOCAL>                                                    |
|        <C_EXT></C_EXT>                                                                |
|      </C_PHONE_1>                                                                     |
|      <C_PHONE_2>                                                                      |
|        <C_CTRY_CODE></C_CTRY_CODE>                                                    |
|        <C_AREA_CODE></C_AREA_CODE>                                                    |
|        <C_LOCAL>492-3961</C_LOCAL>                                                    |
|        <C_EXT></C_EXT>                                                                |
|      </C_PHONE_2>                                                                     |
|      <C_PHONE_3>                                                                      |
|        <C_CTRY_CODE></C_CTRY_CODE>                                                    |
|        <C_AREA_CODE></C_AREA_CODE>                                                    |
|        <C_LOCAL></C_LOCAL>                                                            |
|        <C_EXT></C_EXT>                                                                |
|      </C_PHONE_3>                                                                     |
|    </ContactInfo>                                                                     |
|    <TaxInfo>                                                                          |
|      <C_LCL_TX_ID>CA3</C_LCL_TX_ID>                                                   |
|      <C_NAT_TX_ID>YT3</C_NAT_TX_ID>                                                   |
|    </TaxInfo>                                                                         |
|    <Account CA_ID="0" CA_TAX_ST="1">                                                  |
|      <CA_B_ID>17713</CA_B_ID>                                                         |
|      <CA_NAME>CJlmMuFyibKOmKLHIaTeWugvCgZdmcfpDsYb</CA_NAME>                          |
|    </Account>                                                                         |
|  </Customer>                                                                          |
|</TPCDI:Action>                                                                        |
-----------------------------------------------------------------------------------------

Whether we use the COPY INTO statement in SQL or the xml file reader in Snowpark, the approach is the same: the entire XML record is expressed as a variant column with the name $1. Unfortunately, Snowflake doesn’t support simple dot notation with an XML variant the way it does with JSON (which seems silly after all these years), so instead, we use the GET() function to grab information from an XML node. GET() takes two arguments:

The variant column that contains the XML node we want to GET() from, in our case, $1.
A string expression specifying the thing we want to extract from the variant.

As Paul Horan describes in his excellent article Working with XML in Snowflake (Part II), we have three options with the string expression:

A single-quoted dollar sign '$' retrieves the value of the node.
A single-quoted @ sign '@' retrieves the name of the node.
A single-quoted attribute name preceded with @ retrieves that attribute from the node, such as '@foo'.

Notice that GET() doesn’t allow us to grab nested XML nodes, only element names, element values, and attribute values: we instead have to use XMLGET() for that. To retrieve nested element values, we are forced to pair a GET() function with an XMLGET() function, which can be quite tedious.

The great thing about using Python instead of SQL in cases like this is we can define helper methods to encapsulate the nastiness and remove toil. Our first helper method extracts the value of a nested node but also adds datatype casting and an optional auto-aliasing feature:

# Simplifies retrieving XML elements
def get_xml_element(
        column:str,
        element:str,
        datatype:str,
        with_alias:bool = True
):
    new_element = (
        get(
            xmlget(
                col(column),
                lit(element),
            ),
            lit('$')
        )
        .cast(datatype)
    )

    # alias needs to be optional
    return (
        new_element.alias(element) if with_alias else new_element
    )

The ALIAS() is optional because we may want to use GET_XML_ELEMENT() inside of another function (example below), but ALIAS() can only be used as the outermost function. Our second helper method adds similar functionality for pulling attributes from a node:

# Simplifies retrieving XML attributes
def get_xml_attribute(
        column:str,
        attribute:str,
        datatype:str,
        with_alias:bool = True
):
    new_attribute = (
        get(
            col(column),
            lit(f"@{attribute}")
        )
        .cast(datatype)
    )

    # alias needs to be optional
    return (
        new_attribute.alias(attribute) if with_alias else new_attribute
    )

Our final helper method is rather narrow in its application: it’s helpful for constructing a phone number as a single string from multiple nested elements. Because we need to construct three different phone numbers this way, it was worth the abstraction. Apologies in advance for the poor code formatting, but I lost my will on this one:

# Constructs a phone number from multiple nested fields
def get_phone_number(
        phone_id:str,
        separator:str = '-'
):
    return (
        concat (
            get_xml_element(f"phone{phone_id}", 'C_CTRY_CODE', 'STRING', False),
            when(get_xml_element(f"phone{phone_id}", 'C_CTRY_CODE', 'STRING', False) == '', '')
            .otherwise(separator),
            get_xml_element(f"phone{phone_id}", 'C_AREA_CODE', 'STRING', False),
            when(get_xml_element(f"phone{phone_id}", 'C_AREA_CODE', 'STRING', False) == '', '')
            .otherwise(separator),
            get_xml_element(f"phone{phone_id}", 'C_LOCAL', 'STRING', False),
            when(get_xml_element(f"phone{phone_id}", 'C_EXT', 'STRING', False) == '', '')
            .otherwise(" ext: "),
            get_xml_element(f"phone{phone_id}", 'C_EXT', 'STRING', False)
        )
        .alias(f"c_phone_{phone_id}")
    )

Let’s start by flattening out a few of our nested elements and grabbing a few Action attributes in our first SELECT(). We’ll follow that up with a second SELECT() that mostly uses our helper methods to construct the final DataFrame. We then save that output as a table:

table_name = 'customer_mgmt'
df = (
    session
    .read
    .option('STRIP_OUTER_ELEMENT', True) # Strips the TPCDI:Actions node
    .xml(f"{stage_path}CustomerMgmt.xml")
    .select(
        # flatten out all of the nested elements
        xmlget(col('$1'), lit('Customer'), 0).alias('customer'),
        xmlget(col('customer'), lit('Name'), 0).alias('name'),
        xmlget(col('customer'), lit('Address'), 0).alias('address'),
        xmlget(col('customer'), lit('ContactInfo'), 0).alias('contact_info'),
        xmlget(col('contact_info'), lit('C_PHONE_1')).alias('phone1'),
        xmlget(col('contact_info'), lit('C_PHONE_2')).alias('phone2'),
        xmlget(col('contact_info'), lit('C_PHONE_3')).alias('phone3'),
        xmlget(col('customer'), lit('TaxInfo'), 0).alias('tax_info'),
        xmlget(col('customer'), lit('Account'), 0).alias('account'),
        # get the Action attributes
        get_xml_attribute('$1','ActionType','STRING'),
        get_xml_attribute('$1','ActionTS','STRING'),
    )
    .select(
        # Handling Action attributes
        to_timestamp(
            col('ActionTs'),
            lit('yyyy-mm-ddThh:mi:ss')
        ).alias('action_ts'),
        col('ActionType').alias('ACTION_TYPE'),
        # Get Customer Attributes
        get_xml_attribute('customer','C_ID','NUMBER'),
        get_xml_attribute('customer','C_TAX_ID','STRING'),
        get_xml_attribute('customer','C_GNDR','STRING'),
        # Had to disable auto-aliasing
        try_cast(
            get_xml_attribute('customer','C_TIER','STRING', False),
            'NUMBER'
        ).alias('c_tier'),
        get_xml_attribute('customer','C_DOB','DATE'),
        # Get Name elements
        get_xml_element('name','C_L_NAME','STRING'),
        get_xml_element('name','C_F_NAME','STRING'),
        get_xml_element('name','C_M_NAME','STRING'),
        # Get Address elements
        get_xml_element('address','C_ADLINE1','STRING'),
        get_xml_element('address', 'C_ADLINE2', 'STRING'),
        get_xml_element('address','C_ZIPCODE','STRING'),
        get_xml_element('address','C_CITY','STRING'),
        get_xml_element('address','C_STATE_PROV','STRING'),
        get_xml_element('address','C_CTRY','STRING'),
        # Get Contact Info elements
        get_xml_element('contact_info','C_PRIM_EMAIL','STRING'),
        get_xml_element('contact_info','C_ALT_EMAIL','STRING'),
        # Contruct phone numbers from multi-nested elements
        get_phone_number('1'),
        get_phone_number('2'),
        get_phone_number('3'),
        # Get TaxInfo elements
        get_xml_element('tax_info','C_LCL_TX_ID','STRING'),
        get_xml_element('tax_info','C_NAT_TX_ID','STRING'),
        # Get Account Attributes
        get_xml_attribute('account','CA_ID','STRING'),
        get_xml_attribute('account','CA_TAX_ST','NUMBER'),
        # Get Account elements
        get_xml_element('account','CA_B_ID','NUMBER'),
        get_xml_element('account','CA_NAME','STRING'),
    )
    .write
    .mode("overwrite")
    .save_as_table(table_name)
)

print(f"{table_name.upper()} table created.")

File CustomerMgmt.xml: SKIPPED
CUSTOMER_MGMT table created.

With our new table created, we can construct a new DataFrame and select a few of the interesting columns to demonstrate the result:

df = (
    session
    .table('customer_mgmt')
    .select(
        col('action_ts'),
        col('c_id'),
        col('c_tier'),
        col('c_phone_1')
    )
    .show()
)

---------------------------------------------------------------------
|"ACTION_TS"          |"C_ID"  |"C_TIER"  |"C_PHONE_1"              |
---------------------------------------------------------------------
|2007-07-07 02:56:25  |0       |3         |1-872-523-8928           |
|2007-07-07 03:08:17  |1       |3         |767-4707                 |
|2007-07-07 04:07:36  |2       |3         |420-757-3642 ext: 61998  |
|2007-07-07 05:13:12  |3       |3         |1-819-163-0774           |
|2007-07-07 06:19:28  |4       |NULL      |734-4072                 |
|2007-07-07 07:36:30  |5       |3         |667-588-0328             |
|2007-07-07 09:50:51  |6       |3         |1-475-246-3524           |
|2007-07-07 13:50:10  |7       |3         |466-5901                 |
|2007-07-07 16:06:00  |8       |NULL      |540-4805                 |
|2007-07-07 18:31:48  |9       |NULL      |230-135-8787             |
---------------------------------------------------------------------