Converting XML to JSON in Snowflake using External Functions
NOTE — this post is out of date with the release of Snowpark Python Functions. Please use this approach.
During a recent project with a customer involving migration from Oracle, I was asked if there was an easy way with Snowflake to convert XML data into JSON.
The answer? Out of the box — no, but we can lean on Snowflake External Functions and very easily (and quickly) implement a function in Python to do this using the xmltodict module. An alternative approach might be to write a stored procedure in Snowflake, however, this can get complicated if you have complex XML structures and can hit recursion limits.
I’m a huge fan of External Functions; the ability to be able to execute Python code (or any language in Lambdas, Azures Functions, or GCP) adds a whole new level of extensibility to Snowflakes ELT capabilities, such as the simple example you see here through to Machine Learning inference coupled with tasks and streams.
Snowflake has a lot of great documentation around the configuring of External Functions for AWS, Azure, and now GCP, so I won’t get into that detail too much however this article should provide enough context and instruction on how to get this working.
Note — this demo was built on AWS, however, there should be no reason why it cannot easily be deployed to the other cloud providers.
Python Code
In the working demo I have, my python code resides in an AWS Lambda Function, and it’s so simple; beyond the templated structure of the code (i.e. extracting the event payload, and elements within that) we do the conversion in 1 simple line of code highlighted in bold below — the rest is pretty much part of any code you would see in AWS Lambda Function used with a Snowflake External Function.
Snowflake DDL
Once the Lambda Function is implemented, and the other necessary components are configured (API Gateway, IAM etc) we can create our API Integration in Snowflake — again, this is well documented however I’ll add an example below of doing this for AWS:
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE API INTEGRATION INT_XMLTOJSON_API
API_PROVIDER = AWS_API_GATEWAY
API_AWS_ROLE_ARN = 'arn:aws:iam::000000000000:role/snowflake-api-role'
ENABLED = TRUE
API_ALLOWED_PREFIXES = ('https://zzzzzzzz.execute-api.us-west-2.amazonaws.com');
Once this is created, you can create your external function:
CREATE OR REPLACE EXTERNAL FUNCTION CONVERT_XML_TO_JSON(XML_STR STRING)
RETURNS VARIANT
API_INTEGRATION = INT_XMLTOJSON_API
AS 'https://zzzzzzzz.execute-api.us-west-2.amazonaws.com/xmltojson/';
Side note — I would not use ACCOUNTADMIN to create the External Function, instead grant usage of the API Integration to a lower role in the RBAC structure and use this to create the External Function.
Example
Using our newly created External Function, we can now make calls like so:
SELECT CONVERT_XML_TO_JSON(
'<items>
<item id="0001" type="donut">
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter id="1001">Regular</batter>
<batter id="1002">Chocolate</batter>
<batter id="1003">Blueberry</batter>
</batters>
<topping id="5001">None</topping>
<topping id="5002">Glazed</topping>
<topping id="5005">Sugar</topping>
<topping id="5006">Sprinkles</topping>
<topping id="5003">Chocolate</topping>
<topping id="5004">Maple</topping>
</item>
</items>');
and get this in return:
{
"items": {
"item": {
"@id": "0001",
"@type": "donut",
"batters": {
"batter": [
{
"#text": "Regular",
"@id": "1001"
},
{
"#text": "Chocolate",
"@id": "1002"
},
{
"#text": "Blueberry",
"@id": "1003"
}
]
},
"name": "Cake",
"ppu": "0.55",
"topping": [
{
"#text": "None",
"@id": "5001"
},
{
"#text": "Glazed",
"@id": "5002"
},
{
"#text": "Sugar",
"@id": "5005"
},
{
"#text": "Sprinkles",
"@id": "5006"
},
{
"#text": "Chocolate",
"@id": "5003"
},
{
"#text": "Maple",
"@id": "5004"
}
]
}
}
}