Parsing embedded HTML tables in Snowflake Using a Java User Defined Table Function
In this article, I will demonstrate how we can extend the text parsing capabilities of the Snowflake Data Cloud using a Java User Defined Table Function (UDTF). One of my customers recently asked how I would extract the data from embedded HTML tables in a VARCHAR column. Since I have been a Java programmer, my first thought was to try creating a Java User Defined Function. I found that a Java UDTF can provide a very elegant solution to this problem by returning the number of rows from the embedded tables.
Sample Test Data
I first came up with a set of scenarios that I could use to deal with different edge cases:
Java Libraries
There are two Java libraries that I used together to solve the problem
- org.jsoup: This library can parse nearly any HTML text and build a document object model similar to a web browser that you can query using CSS selectors. Click here to download the jar from the Maven repository.
- org.json: This library allows you to assemble data in Java objects and output the data as valid JSON text. Click here to download the jar from the Maven repository.
Once you have downloaded the two jar files to a temporary folder on your computer, you can create an internal stage in Snowflake and upload the two libraries to the stage with statements like the following.
CREATE OR REPLACE STAGE JAVA_LIB;PUT 'file:///tmp/javalib/*.jar' @JAVA_LIB AUTO_COMPRESS=FALSE;
Source Code of the Java UDTF
The following code is executed directly in Snowflake and defines two classes. First is the OutputRow
class that maps Java data types to SQL data types. Second is the ParseHTMLTables
class that implements the two required methods, getOutputClass()
and process()
. The process()
method must return a Stream
object which we can provide from an ArrayList
with the rows from our HTML tables. I recommend reviewing the Snowflake documentation on the requirements for writing UDTF.
How to use the UDTF in a query
This is the best part of using a UDTF. With a LATERAL
join, we can pass the HTML column in as a parameter and Snowflake will return as many rows as the UDTF will return. Because we can’t be certain of the number of fields, or whether there were any headers to use for tags, our UDTF returned its parsed data in a VARIANT
column called TABLE_ROW
that can be queried by tag or column position.