Snowflake
Published in

Snowflake

Parsing embedded HTML tables in Snowflake Using a Java User Defined Table Function

In this article, I will demonstrate how we can extend the text parsing capabilities of the Snowflake Data Cloud using a Java User Defined Table Function (UDTF). One of my customers recently asked how I would extract the data from embedded HTML tables in a VARCHAR column. Since I have been a Java programmer, my first thought was to try creating a Java User Defined Function. I found that a Java UDTF can provide a very elegant solution to this problem by returning the number of rows from the embedded tables.

Sample Test Data

I first came up with a set of scenarios that I could use to deal with different edge cases:

Java Libraries

There are two Java libraries that I used together to solve the problem

Once you have downloaded the two jar files to a temporary folder on your computer, you can create an internal stage in Snowflake and upload the two libraries to the stage with statements like the following.

CREATE OR REPLACE STAGE JAVA_LIB;PUT 'file:///tmp/javalib/*.jar' @JAVA_LIB AUTO_COMPRESS=FALSE;

Source Code of the Java UDTF

The following code is executed directly in Snowflake and defines two classes. First is the OutputRow class that maps Java data types to SQL data types. Second is the ParseHTMLTables class that implements the two required methods, getOutputClass() and process(). The process() method must return a Stream object which we can provide from an ArrayList with the rows from our HTML tables. I recommend reviewing the Snowflake documentation on the requirements for writing UDTF.

How to use the UDTF in a query

This is the best part of using a UDTF. With a LATERAL join, we can pass the HTML column in as a parameter and Snowflake will return as many rows as the UDTF will return. Because we can’t be certain of the number of fields, or whether there were any headers to use for tags, our UDTF returned its parsed data in a VARIANT column called TABLE_ROW that can be queried by tag or column position.

--

--

--

Snowflake articles from engineers using Snowflake to power their data.

Recommended from Medium

■ Solbody Platform

Kotlin pearls 3: It’s an Object… It’s a function… It’s an Invokable

Java For Mac Os X

Big Data & QA — A Concise Overview

Static and Dynamic Memory Allocation

Sigslot — A simple signalling mechanism for Panel

【缉魂 緝魂 完整版本】 【TW-ZH】 完整版觀看電影在線小鴨 2021 完整的電影

UI Automation for mortals: simple yet elegant Page Objects using Java and Selenide

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dan Flippo

Dan Flippo

Sr Solutions Architect @Snowflake; SME: dbt, Kafka, Oracle BI & DB, StreamSets, Vertica. Views & opinions are my own and do not represent those of my employer.

More from Medium

Snowflake UDFs for Hash Validation, Part 2: Java

How to Reverse Engineer a Snowflake Database Schema

Snowflake Backups To Amazon S3 Using Terraform

Snowflake Data ingestion from Azure using Snowpipe