Parsing Complex Excel Files Using Snowflake’s New Java UDF Feature

A simple walkthrough of Snowflake’s new Java UDF feature

Venkat Sekar
Hashmap, an NTT DATA Company
7 min readJul 21, 2021

--

In many of my Snowflake engagements, clients typically look to side-load data from Excel files. These files are either maintained by their internal business users or sourced from external data providers. In most cases, these Excel files are parsed and converted to CSV or JSON files and then hosted in Snowflake as external tables.

Recently Snowflake previewed the Java UDF feature, so I started exploring its capabilities. In this post, I’ll share my findings and a prototype implementation that leverages this new feature to parse Excel files (or any other files) directly into Snowflake.

Key Takeaways

  • The Snowflake Java UDF can use third-party libraries if the functionality conforms to Snowflake’s limits
  • Other than specific imported files, no network or file operations are allowed
  • Custom and complex Excel or PDF files can be directly parsed as part of select statements
  • Reflect on how to separate the data file from the packaged jar to allow changes in data to be observed
  • A working prototype of the Java UDF implementation

Java UDF

Third-party library usage

Initially, when I came across the Snowflake Java UDF feature I was not impressed because we can already do looping using conditional constructs within javascript.

To ensure that my understanding was correct, I started reviewing the documentation and realized that the Java UDF allows you to package third-party libraries (like XML parsers, Excel parsers, etc.) as part of a package. This is different from the Javascript approach because with Javascript UDFs you are limited solely to what Snowflake provides.

Reading files

As detailed in Snowflake’s documentation, it is not possible to make network calls or read files from a local system. The reasoning for this is to deter potential threats and security loopholes.

But, if you look between the lines, the doc notes that you can read files as long as they are part of the import statement. An example of this is detailed in the section “Java UDF Cookbook”.

Questions & Ideation

All of the above got me thinking…

  • Can I host the UDF jar in an external stage, i.e. an S3 bucket?
  • If we declare the Excel data file, also hosted in S3, as an import, can the Java UDF still read and process the file?
  • How do we keep the UDF package (jar) separate from the Excel data file?
  • If the user wants to add or update a record in an Excel file and then uploads the newly changed file into S3, can the UDF observe the changes?

The answer to all of these questions is yes. However, to fully understand the different factors in play, a detailed walkthrough should help provide some context.

Implementation prototype

The prototype code is called snowexceljudf and is available publically in the Hashmap GitLab Account.

Dataset

We are using the “Motor-vehicle-sales” open dataset, to demonstrate parsing from a complex excel file. The dataset is stored in a separate sheet named “Data”.

Execution

The call to the UDF will return the records as an array of JSON records.

The array can then be lateral and flattened to columns as shown below.

Implementation class

Apache POI is a popular and widely used java library to parse XLSX files and other document formats. We can use this library to read the excel data files and return the data as a list of JSON records. The logic for this is implemented in the java class “com.hashmap.blog.snowflake.MotorVehicleParser”.

Following along with Snowflake’s Java UDF documentation, we can use the Maven Assembly Plugin to package the compiled class and the various dependencies into a “fat” jar. In this scenario, the size of the jar came out to be 40MB.

Hosting

We host the packaged code and the excel data in an S3 bucket. The S3 bucket will be defined as an external stage in Snowflake making the data and packaged code appear in separate folders.

The data will be stored inside a folder in the S3 bucket. This S3 bucket will then be declared as an external stage in Snowflake.

Observations & Thoughts

Package artifact size

For this prototype, the size came out to be around 40MB. When the UDF gets called, Snowflake copies the artifact from the stage to a virtual machine. The larger the size is, the longer it takes to initialize, so please be aware.

Record count

In this scenario, there are around 2,500 records, but a larger record count could potentially cause an issue. Records close to this count tend to be normal with the clients that I have worked with in the past, but I recommend being mindful of this.

Changes or updates to the Excel data file

If a change is made to Excel, the UDF is able to observe the changed records. However, the filename should be the same original data file name. Just because the file name can be passed as a parameter, does not mean files can be read from any stage. The files are constrained to the list of files defined in the import statement of the create function call.

Read and Store

If the Excel file does not change often, it is actually more performant to retrieve and store the data within a transient table. Data processing would then refer to this transient table for various data processing needs.

Development experience

In the past developing JavaScript UDFs was a simple but cumbersome task because there was not a good IDE and a select/DML operation was always needed to test the code. Not being able to import libraries meant that I wasn’t able to develop common modules outside, resulting in copies on top of copies of the same code across multiple javascript UDFs.

With Snowflake’s Java UDF, I have the ability to develop locally using an IDE of choice. I was actually able to create a package using Maven and there is potential to perform unit tests using JUnit or another framework. Personally, I am a huge fan of code review, documentation, etc., so this really excites me.

Other Java UDF usage scenarios

Based on the above, I believe the following usage patterns can be adopted using the Snowflake Java UDF.

  • Parsing PDF files or other custom file formats that contain data
  • Simple ML predictive determination models could be packaged with the jars
  • Complex mathematical calculations which do not fit inside a javascript stored procedure

Closing Thoughts

This started out as a simple experiment but it really turned into several potential use case scenarios with actual real-world value. I hope this walkthrough has been given you something to think about. Either way, the new JavaScript UDF featured released by Snowflake is here to stay and will hopefully expand.

Ready to Accelerate Your Digital Transformation?

At Hashmap an NTT DATA Company, we work with our clients to build better, together. We are partnering across industries to solve the toughest data challenges — we can help you shorten time-to-value!

We offer a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our data and cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Additional Resources

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Venkat Sekar is a Senior Architect for Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--