Working with large JSON files in Snowflake (part 2 — sort of…)

Introduction

Back in April of 2021, I wrote a blog post that outlined a technique for dealing with JSON files that exceed Snowflake’s internal limit of 16MB (compressed). It presented a way to use the open source jq utility to pre-process the JSON files, and split them into smaller chunks that Snowflake could ingest into a VARIANT column. It’s not particularly elegant, but it got the job done. Using that workflow, I was able to get Snowflake to ingest JSON files in excess of 200MB!

Since that article was published (a lifetime ago in cloud years…), Snowflake has released two key features into General Availability: Tabular Java User-defined Table Functions (UDTF’s), and Directory Tables. My colleagues at Snowflake were already developing sample applications that combined these two features, such as using Java UDFs to parse the text from PDF documents, and extracting health metadata from DICOM images. Those examples started me thinking: why not use Java to parse apart the JSON document in real-time? Let’s see how that would work.

Getting Started

The first step is to get some really large JSON files. I’ve downloaded one from the OpenFDA project. This dataset is produced by the FDA’s Adverse Event Reporting system for pharmaceuticals. I grabbed file #1 from Q4 2020. The file is 129MB zip archive, and over 607MB when uncompressed. If you look at the structure of that file, it has a basic “header” node, and then an array named [results] that contains 12,000 individual reports. None of these exceed the 16MB boundary, but the file in aggregate is too large to be ingested directly. What I want to achieve is to ingest each “row” of the results array as JSON into a table with a single VARIANT column. The strip_outer_array parameter to the COPY statement won’t help here either, because the [results] array isn’t the outermost layer.

Step 1: Create an internal stage (I called mine fda_stage). Make sure to include the new directoryargument.

Step 2: Unzip the file archive from OpenFDA, and use SnowSQL to PUT that file into the internal stage named fda_stage and refresh the directory table;

You can test the directory table with a simple SQL query: select * from directory(@fda_stage);

Step 3: Create a table with a variant column.

Step 4: Create an internal stage that will hold the compiled Java code.

Roll Up Your Sleeves

Step 5: Using your Java IDE of choice, define these two Java classes. The first defines the shape of the output row that will be emitted, and the second is the Java method SplitJSON(). (You can obviously change the folder structure and the package naming conventions to match your requirements.)

This function takes two arguments: the first is the JSON file to parse, and the second is the name of the array node that you want to return. In our example, the parent node is called [results], so you’ll see that hard-coded as the second argument.

Step 6: Export the Java project as a .JAR file, and use SnowSQL to PUT the file into the @jars stage.

Step 7: Define the Java UDTF that invokes the handler method in the .jar file.

Step 8: Test the method with a simple SELECT statement. In this test, we’re hard-coding the relative path and filename of a single JSON file that we want to parse. This is just a test — we’ll get to the good stuff in Step 9…

As you can see, the function has returned a tabular resultset containing one row for each entry in the [results] array from the JSON. This is just the first 20 records due to the LIMIT clause in the SQL, but even this wouldn’t have been possible without Java functions. We’re instantiating the entire JSON document in the JVM sandbox memory rather than trying to construct a VARIANT out of it.

Step 9: Pass the entire Directory Table to the UDTF, and insert the data directly into the ADVERSE_EVENT table. This uses a two “table” JOIN in the SQL syntax, which you might think would cause a Cartesian product. But in this case, Snowflake knows to pass each row from the directory table to the Java UDTF.

This passes the relative_path value from the directory table as the fileName parameter to the build_stage_file_url() method. That generates a fully qualified file that the Java UDTF can use as its input. This will parse the entire contents of the stage at once, and with Snowflake’s parallelism, every node in the assigned cluster will be processing JSON files in parallel.

This took about 40 seconds to process a single 607MB JSON file, extracted 12,000 nodes, and inserted them into a table as a VARIANT. That’s pretty cool, IMO!

Conclusions, Caveats, and Kudos

A) This code can only ask for one “array” node at a time. So if you need multiple nodes out of a single JSON document, you have to call the function once for each node. One possibility would be to construct a Table Stream on the directory table, with a different Table Stream per requested node, then create Snowflake tasks that processes their assigned Table Streams, and invokes this UDTF. That would totally automate the workflow.

B) The node you’re asking for has to be an immediate child of the outermost root node. I haven’t experimented with any deeper-level array nodes, or arrays within arrays.

C) This is not supported code. The usual warnings for “running stuff that Paul wrote” applies...

Lastly, I want to thank my friend and colleague David Proksch for his help with the Java part, because I am a truly TERRIBLE Java developer.

Let it Snow!!

--

--