Using SQL to Parse a Large JSON Array in Snowflake

JSON is a great data transfer format, and one that is extremely easy to use in Snowflake. From time to time, we get questions from customers about dealing with JSON files that are ‘too large’ for Snowflake to handle. Usually, it turns out that it isn’t actually the file that’s too big, but some element of the file. As the documentation for the VARIANT data type notes, the maximum size of a VARIANT object is 16MB. This is the same as the maximum size of a VARCHAR.

As I mentioned, the issue isn’t the total file size, but the size of the largest element. In the case of the problem a customer ran into recently, the entire JSON file consists of a single element called ‘Report_Entry’, which holds an array of all 22,000 active employees of the company. It looks something like this:

Since the Report_Entry element can’t be over 16MB, and there are about 22k employees, that gives us about 750 characters per employee if we want the array to fit into a single element. This particular source file is running more like 1,000 characters per employee, which makes the element too big and, therefore, the file can’t be processed directly.

We were discussing using Python to convert this file into something that Snowflake could handle more easily, or to parse it directly, but since the format is extremely straight-forward, I realized there’s an easy way to do it directly in SQL.

Here’s how that works:

  1. Create a file format that will let us load the data as raw text, one row per line
  2. Stage the data into a table, with row numbers
  3. Dump the element name, array opener, and array closer, leaving just the employee records
  4. Identify the first and last row for each employee record
  5. Break the existing name/value pairs into separate strings
  6. Aggregate the entire result back into JSON objects in a VARIANT column of a table

At that point, the data will be usable using easy Snowflake SQL syntax.

Note: some of the snippets below make use of column or other object names that are only included in another snippet. The full SQL code is included near the end of the article.

1. Create the File Format

Snowflake’s FILE FORMAT object allows defining lots of parameters about the structure of a data file. In this particular case, we’re going to use the standard CSV format as the basis, but we’re going to generate only a single column by not using a field delimiter:

CREATE OR REPLACE FILE FORMAT RAW_TEXT_ROWS
TYPE = CSV
FIELD_DELIMITER = 'NONE'
REPLACE_INVALID_CHARACTERS = TRUE;

When we use this file format, then, every line of the data file will be turned into a single row in the target table.

2. Stage the Data to a Table

In this snippet, I am creating a very simple target table called JSON_ROW_INPUT, which will hold the data and line number for each line in the file. Then, I run the COPY INTO command. I staged the file to an internal stage called test and the beginning of the file name happens to be Active, as in ‘Active Employees’, so the query simply references that location and path: @test/Active.

CREATE OR REPLACE TABLE JSON_ROW_INPUT 
(
ROW_DATA VARCHAR,
ROW_NUM INTEGER
);

COPY INTO JSON_ROW_INPUT
FROM (SELECT $1, METADATA$FILE_ROW_NUMBER
FROM @test/Active)
FILE_FORMAT = (FORMAT_NAME ='RAW_TEXT_ROWS');

As you can see, I included the format name of the FILE FORMAT object, but I also included METADATA$FILE_ROW_NUMBER; this ensures that we can determine the exact order in which each line appeared in the original file. That will be important when we’re assembling rows of this table into objects.

3. Dump Unnecessary Rows

The raw JSON came in a very easy format — each line contained one token. It’s this fact that allows the rest of this to work so easily. For example, there are only two lines of JSON that aren’t needed, which are the "Report_Entry": [{ line near the top that identifies the single element, and the }] line near the bottom that completes the array. If we remove those two, we will have just a complete list of valid employee objects, thanks to the starting and ending brackets that surround the rest of the content — those will turn into the start and end of the first and last employee record, respectively. We can accomplish this very simply, with a where clause:

WHERE ROW_DATA NOT IN ('"Report_Entry": [{', '}]')

4. Identify the Rows for each Employee

Since we know that the rows were ordered in the original file, and the only rows present are actual employee records and the dividing brackets, we can filter out all the non-bracket rows, and do a self join on the table to match up the brackets.

SELECT OB.ROW_NUM OB_ROW, MIN(CB.ROW_NUM) CB_ROW
FROM DATA OB
INNER JOIN DATA CB ON OB.ROW_NUM < CB.ROW_NUM
WHERE OB.K = '{' AND CB.K = '}'
GROUP BY OB.ROW_NUM

In this join, OB represents the opening bracket in each JSON object, while CB represents the closing bracket. Naturally, the first bracket in the entire file is on the first line and the join will link it to all of the many closing brackets later in the file (with a later row number), but the MIN aggregation ensures that we only get the closing bracket that follows most closely after each opening bracket. This lets us know the starting and ending points of each employee record.

I initially thought I’d use a windowing function like ROW_NUMBER, but realized this was simpler, and probably faster.

5. Break the N/V Pairs

Every row of the file that contains actual data has a name and a value, surrounded by double quotes and separated by colon-space (: ). To turn these into strings, we can use the SPLIT function, but we also need to get rid of some possible blank space and extra quotation marks:

TRIM(SPLIT(TRIM(ROW_DATA, ' ,'), ': ')[0]::VARCHAR, '"') K,
TRIM(SPLIT(TRIM(ROW_DATA, ' ,'), ': ')[1]::VARCHAR, '"') V

This takes the original row data, and trims off any leading or trailing spaces and commas; then it splits the row into an array containing as many values as necessary based on the presence of the : . In this case, that means we will always have either one or two values; if there’s no separator, we’ll get one value ({) or (}) while if there is a separator, we’ll get the name, aka ‘key,’ and value. Finally, for each string, it trims off any double-quotation marks.

It is worth noting that I’m working with a very clean data set that doesn’t have any extraneous occurrences of : . In many other data sets, it may be necessary to build a slightly more complicated expression to account for the possibility that either a key or a value might actually have : in it.

6a. Aggregate it all back into JSON

Snowflake supports numerous SQL functions for manipulating objects in JSON format, such as OBJECT_CONSTRUCT and OBJECT_AGG. OBJECT_CONSTRUCT, if given pairs of values will convert those into a single object of name value pairs. For example:

OBJECT_CONSTRUCT('Key1', 'Value1', 'Key2', 'Value2') will produce:

{
"Key1": "Value1",
"Key2": "Value2"
}

OBJECT_AGG does something very similar, but based on a set of rows. If it were just run on a table containing the above values, it would return the exact same JSON output as above. In our case, though, every row of the table can be associated with an opening bracket’s row number, so we can group on that and we will get back a single row per employee.

SELECT OBJECT_AGG(K, V::VARIANT) EMPLOYEE
FROM DATA
INNER JOIN OB_AND_CB ON ROW_NUM BETWEEN OB_ROW AND CB_ROW
WHERE DATA.K NOT IN ('{', '}')
GROUP BY OB_ROW
ORDER BY OB_ROW

The OBJECT_AGG function requires that the first (the key) argument be a string, which is how it is already formatted from the SPLIT, and the second (the value) be a variant, so I’m converting the V column into VARIANT on the fly. The INNER JOIN uses the row number for the opening bracket (OB) and closing bracket (CB) to gather up all the records for a given employee, and we filter out all the actual brackets at this point.

The GROUP BY ensures that all the key/value pairs for a single employee will be aggregated into an object together. The ORDER BY isn’t necessarily important, but I put it in so that the records will come out in the same order as what’s in the file.

6b. Write the converted data into a table

Putting all those pieces together, we get the following query and wrap it in a CREATE OR REPLACE TABLE command to persist the data in a table. Here is the full SQL, which brings together all the pieces above.

CREATE OR REPLACE TABLE EMPLOYEE_JSON AS
WITH DATA AS (
SELECT ROW_NUM,
TRIM(SPLIT(TRIM(ROW_DATA, ' ,'), ': ')[0]::VARCHAR, '"') K,
TRIM(SPLIT(TRIM(ROW_DATA, ' ,'), ': ')[1]::VARCHAR, '"') V
FROM JSON_ROW_INPUT
WHERE ROW_DATA NOT IN ('"Report_Entry": [{', '}]')
),
OB_AND_CB AS (
SELECT OB.ROW_NUM OB_ROW, MIN(CB.ROW_NUM) CB_ROW
FROM DATA OB
INNER JOIN DATA CB ON OB.ROW_NUM < CB.ROW_NUM
WHERE OB.K = '{' AND CB.K = '}'
GROUP BY OB.ROW_NUM
)
SELECT OBJECT_AGG(K, V::VARIANT) EMPLOYEE
FROM DATA
INNER JOIN OB_AND_CB ON ROW_NUM BETWEEN OB_ROW AND CB_ROW
WHERE DATA.K NOT IN ('{', '}')
GROUP BY OB_ROW
ORDER BY OB_ROW
;

The original data file consisted of 651,220 lines, and the parsing was only this easy because the file was already well formatted, but JSON often is. Even if it weren’t formatted quite so well, it probably wouldn’t be very hard to factor in some simple conversions to make this work, such as making sure the brackets are on lines by themselves.

Even more important, though is that this particular JSON isn’t very complicated. Just as I was able to keep the key-value split very simple because I knew there were no extra : in my sample, I was able to use the brackets as delimiters because there were no child objects involved. If any of the elements included additional arrays or JSON, I’d have needed to do something more complicated to get it to work, possibly including using a procedural language instead of a single SQL query.

Conclusion

My customer’s original comment was “this not-very-big JSON file is too big to handle in Snowflake” (not a direct quote). Often, ‘too big’ means having some process shred the file, such as with CSV files that can often just be broken into chunks at random line breaks. My colleague, Paul Horan, gives a great example of that & describes how to use jq to split up JSON files. Paul also demonstrated doing the same thing with Java, as well as in a streaming mode. In some cases, though, it isn’t necessary to break out a procedural language just because the JSON is big.

--

--

Dylan Kaufman
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Principal Sales Engineer at Snowflake; biker, father, reader, programmer, long-time analytics consultant