Parsing embedded HTML tables in Snowflake Using a Java User Defined Table Function
In this article, I will demonstrate how we can extend the text parsing capabilities of the Snowflake Data Cloud using a Java User Defined Table Function (UDTF). One of my customers recently asked how I would extract the data from embedded HTML tables in a VARCHAR column. Since I have been a Java programmer, my first thought was to try creating a Java User Defined Function. I found that a Java UDTF can provide a very elegant solution to this problem by returning the number of rows from the embedded tables.
Sample Test Data
I first came up with a set of scenarios that I could use to deal with different edge cases:
There are two Java libraries that I used together to solve the problem
- org.jsoup: This library can parse nearly any HTML text and build a document object model similar to a web browser that you can query using CSS selectors. Click here to download the jar from the Maven repository.
- org.json: This library allows you to assemble data in Java objects and output the data as valid JSON text. Click here to download the jar from the Maven repository.
Once you have downloaded the two jar files to a temporary folder on your computer, you can create an internal stage in Snowflake and upload the two libraries to the stage with statements like the following.
CREATE OR REPLACE STAGE JAVA_LIB;PUT 'file:///tmp/javalib/*.jar' @JAVA_LIB AUTO_COMPRESS=FALSE;
Source Code of the Java UDTF
The following code is executed directly in Snowflake and defines two classes. First is the
OutputRow class that maps Java data types to SQL data types. Second is the
ParseHTMLTables class that implements the two required methods,
process() method must return a
Stream object which we can provide from an
ArrayList with the rows from our HTML tables. I recommend reviewing the Snowflake documentation on the requirements for writing UDTF.
How to use the UDTF in a query
This is the best part of using a UDTF. With a
LATERAL join, we can pass the HTML column in as a parameter and Snowflake will return as many rows as the UDTF will return. Because we can’t be certain of the number of fields, or whether there were any headers to use for tags, our UDTF returned its parsed data in a
VARIANT column called
TABLE_ROW that can be queried by tag or column position.