Working with large JSON files in Snowflake — Part III

Introduction

Back in April of 2021, I was confronted with a customer issue. They were building a Security Data Lake with Snowflake and one of their data sources was a very large file in JSON format. While Snowflake can natively ingest semi-structured formats (JSON, XML, Parquet, etc.), the max size of a VARIANT column is 16MB compressed. Even though the individual array elements were well under that limit, they were so numerous that the files were significantly larger than the 16MB ingestion limit.

My first suggestion was to pre-process the JSON files and split them into smaller chunks with the jq commandline utility, and I blogged about that here. That worked, but I knew it wasn’t the optimal solution because it required you to guess at an appropriate “chunk size” before running the batch command. If you got it wrong and the chunks were still too big, you had to re-split them using a smaller chunk value.

Then Snowflake introduced two new features: Java User-Defined Table Functions, and Directory Tables on external stages. It occurred to me that I could solve this problem by combining those two features. I wrote a Java UDTF that used an open source JSON parser to instantiate the file as a JSON object, then returned the array I wanted as a table of varchar strings. Since none of the array elements were larger than 16MB, they could be loaded directly into a table as a VARIANT. That also worked great. I tested it with JSON files in excess of 700MB, and I blogged about that here.

About 2 months ago, the customer came to me claiming that my function had started crashing, and was blowing out the JVM heap. They were using the function to parse a JSON file that was over 1.6GB… I saw the issue right away — my code was written to instantiate the entire JSON structure as an in-memory JsonObject, and that’s just too memory-intensive with files of that size. I decided to try and rewrite this function using a streaming JSON parser.

Streaming Parsers

Streaming parsers use less memory because they operate on a JSON document as a sequence of tokens that are traversed in depth-first order. Because the streams operate on one token at a time, they impose minimal memory overhead. After a thorough review of the available options, I chose the Gson parser library. It allows for both streaming and object parsing, and I ended up using both in the new version of the function.

Here’s an example of the “token” concept with a streaming parser. Give the following JSON snippet, I’ve illustrated some of the tokens that get generated when the specific character is parsed.

The code that processes these tokens is essentially a big WHILE loop, walking through the document from the first BEGIN_OBJECT token until the final END_DOCUMENT token is encountered. You write the processing code for each type of token into a SWITCH statement.

Implementing the parseJSON function

Step 1: Get the Gson parser JAR file

Go to https://mavenlibs.com/jar/file/com.google.code.gson/gson and download gson-2.9.0.jar. (This code was tested with 2.9.0, but check for any newer stable releases…)

Step 2: Upload the Gson .jar file into your internal stage

I prefer to use a dedicated internal stage for my Java .jar files, which I’ve named jars. If you don’t already have an internal stage for this purpose, create it now. Remember that stage definitions live within the context of a database. Since this function may be needed across other databases, it may be smart to create a utils database specifically to contain all these types of assets.

To upload the jar file into the stage, connect with the SnowSQL utility and set the context to the correct database and schema. Use the PUT command to upload the file into the stage, making sure not to compress it:

PUT file://gson-2.9.0.jar @JARS auto_compress=false;

Step 3: Create the Java UDTF

This uses the “in-line” style for defining non-SQL functions. This does the function declaration and compiles the Java code in one step, rather than requiring a separate Java IDE. Either method works well, but I find this simpler.

Run this code in a Snowflake worksheet:

Some key points of interest in this version of the function:

Line 2: the prior version of this function returned an array of strings. I encountered situations where the strings exceeded the max length of a varchar field, so I changed it to return an array of Snowflake variants.

Lines 4 & 7: if your stage isn’t called jars or you’ve decided to upload a different version of the Gson parser, you’ll have to change these lines accordingly.

Line 5: this version also uses the Variant datatype as defined in the Snowpark Java classes. We add Snowpark to a function with the packages statement, not the imports statement.

Line 45: I found a more expedient method of determining whether the JSON file is GZipped or not.

Line 61: Added a LogMsg() function that can help with debugging. Any messages written here are included in the resultset. Set the boolean variable logging on line 72 to TRUE to enable this.

Line 85: this is the first of two WHILE loops in the function. This loop skips through every token it encounters until it finds the name you’ve requested AND it is also the start of an array. If your JSON happens to have two arrays that are named exactly the same and you want the second one, this code won’t help you.

Line 113: this is the start of the second WHILE loop. This uses a SWITCH statement to handle the different token types. When a BEGIN_OBJECT is encountered, Gson can grab the entire object and pop that as a Snowflake Variant onto the results array. This loop continues until the final END_ARRAY token is encountered.

Step 5: Create a Directory Table for your JSON files.

This can be either an internal or external stage.

CREATE or REPLACE stage @dataStage 
DIRECTORY = (ENABLE = TRUE REFRESH_ON_CREATE = TRUE);

Note: for external S3 stages, you can also add AUTO_REFRESH = TRUE

Step 6: Double-check the Directory Table metadata.

Run this SQL to see if the stage metadata has refreshed.

select build_scoped_file_url(@dataStage, relative_path) fileName 
from directory(@dataStage) ;

If there are no results, refresh the metadata manually with this statement:

alter stage dataStage refresh;

Step 7: Test the function call

Start by testing against a single file in the stage with this SQL statement, replacing ‘path/and/filename.ext’ with one of your filenames, and ‘Records’ with the name of the desired array from your JSON:

with F (fileName) as (
select build_scoped_file_url(@dataStage, relative_path) fileName
from directory(@dataStage)
where relative_path = 'path/and/filename.ext'
)
select datum
from F,
table(parseJSON(F.fileName, 'Records'))
;

Step 8: Insert variant values directly into a Snowflake table

To route the resultset into a table with a variant column, use the following sample SQL (again, replacing ‘Records’ with the array name from your file):

CREATE or REPLACE table T1 (V variant);INSERT into T1 (V)with F (fileName) as (
select build_scoped_file_url(@dataStage, relative_path) fileName
from directory(@dataStage)
)
SELECT datum
from F,
table(parseJSON(F.fileName, 'Records'))
;

Conclusion

This solution presents a workaround to a specific JSON handling issue in Snowflake — the processing of JSON files that exceed the 16MB (compressed) limitation. But it’s far from a complete solution. For example, the requested array must be a 1st-level object in the JSON file. This code can’t process “nested” arrays. Each individual array element must still be under the 16MB limitation. This code is presented “as-is” and ongoing support is neither promised nor implied.

All that being said, I’d love to know if this helps you! Drop me a note in the comments and let me know.

Let it Snow!

--

--