Parsing embedded HTML tables in Snowflake Using a Java User Defined Table Function

In this article, I will demonstrate how we can extend the text parsing capabilities of the Snowflake Data Cloud using a Java User Defined Table Function (UDTF). One of my customers recently asked how I would extract the data from embedded HTML tables in a VARCHAR column. Since I have been a Java programmer, my first thought was to try creating a Java User Defined Function. I found that a Java UDTF can provide a very elegant solution to this problem by returning the number of rows from the embedded tables.

Sample Test Data

I first came up with a set of scenarios that I could use to deal with different edge cases:

Java Libraries

There are two Java libraries that I used together to solve the problem

Once you have downloaded the two jar files to a temporary folder on your computer, you can create an internal stage in Snowflake and upload the two libraries to the stage with statements like the following.

CREATE OR REPLACE STAGE JAVA_LIB;PUT 'file:///tmp/javalib/*.jar' @JAVA_LIB AUTO_COMPRESS=FALSE;

Source Code of the Java UDTF

The following code is executed directly in Snowflake and defines two classes. First is the OutputRow class that maps Java data types to SQL data types. Second is the ParseHTMLTables class that implements the two required methods, getOutputClass() and process(). The process() method must return a Stream object which we can provide from an ArrayList with the rows from our HTML tables. I recommend reviewing the Snowflake documentation on the requirements for writing UDTF.

How to use the UDTF in a query

This is the best part of using a UDTF. With a LATERAL join, we can pass the HTML column in as a parameter and Snowflake will return as many rows as the UDTF will return. Because we can’t be certain of the number of fields, or whether there were any headers to use for tags, our UDTF returned its parsed data in a VARIANT column called TABLE_ROW that can be queried by tag or column position.

--

--

Dan Flippo
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Sr Solutions Architect @Snowflake; SME: dbt, Kafka, Oracle BI & DB, StreamSets, Vertica. Views & opinions are my own and do not represent those of my employer.