Get Correctly Formatted Deep-Nested JSON Files At Scale Directly From (Azure) SQL Server
Some third party applications usually ask for JSON files as input to import new data. An example is Splunk, a software platform to search, analyze and visualize the machine-generated data gathered from the websites, applications, sensors, devices etc. If the JSON format is mandatory for sharing information and the data you need to analyze is stored in a database, you need to transform your data from a tabular format to a JSON one, following a JSON schema. If you give this task to developers, the first idea they usually follow is to develop an application (in C#, Java or whatever programming language) that connects to the source database, loads data using an ORM (with consequent possible performance issues due to the inability to write optimum SQL code), transforms them using proper libraries and then exports the output in a JSON file. But if your data is stored in a Microsoft SQL Server Database, you are lucky, since starting from the 2016 version for the on-premises version and on Azure SQL Database, you can manage JSON data directly into your RDBMS.
The following is an ER diagram of the tables we’ll use in our queries:
SQL Server JSON Capabilities
Starting from SQL Server 2016 on-premises and in Azure SQL Database, you can use built-in functions and operators to do the following things with JSON text:
- Parse JSON text and read or modify values
- Transform arrays of JSON objects into table format
- Run any T-SQL query on the converted JSON objects
- Format the results of T-SQL queries in JSON format
This article is focused on the third point, i.e. how to properly manage your result set given by a complex query, in order to get your nested JSON in the right format as output.
Basically, you have two methods to generate JSON results using the FOR JSON clause: FOR JSON AUTO and FOR JSON PATH.
FOR JSON AUTO Limitations
As you can read into the official docs, “when you specify the AUTO option, the format of the JSON output is automatically determined based on the order of columns in the SELECT list and their source tables. You can’t change this format”.
Order of SELECT Columns
Let’s try this. Suppose you want to share few details about a couple of orders with an external application. First of all, we need to gather data in a tabular format:
Here the result:
As you can see, the order with code 15342 has two lines. Now, let’s transform this result set in a JSON text adding “FOR JSON AUTO” at the end of the previous query first and then swapping the SELECT columns, running the following two queries:
You’ll get the following different results:
Actually the JSON result isn’t formatted as you can see in the previous code if you use SQL Server Management Studio as client. You’ll get one single line of JSON text. You can click on it and save the file with the .json extension. Then I use the JSTool extension for Notepad++, but you can use whatever formatting tool you want (Visual Studio Code, online services like JSON Formatter, etc.). You can also open the JSON file in a browser to get a good visualization of it. For example, in Firefox you’ll have the following UI:
If instead you’re using Azure Data Studio, after clicking on the JSON result, it will be automatically formatted in a new tab if it is not too much large, otherwise a single line will be shown and the formatting option will be disabled.
Getting back to our example, the nesting order of orders and details swaps according to the order of the columns (and indirectly of the source tables in join) you used in your SELECT statement in this way:
- Each table corresponds to a nesting section. The first section to be nested (level 0) is the one corresponding to the first column used in the SELECT statement. Then all the other columns of that table/section present in the SELECT statement will be grouped together, even if their place is not consecutive to the first column.
- The first column used in the SELECT statement that is not contained into the first table/section, determines the second nested section (level 1).
- And so on.
Multiple Nested Sections at Different Levels
Moreover, FOR JSON AUTO doesn’t allow to nest multiple sections in your JSON at different levels of your choice. For example, suppose you want to get order details like these ones, with salesPerson and details sections nested at the same level (the first one):
If you try a query like the following one:
you’ll get the following wrong result, where the details section (level 2) is nested inside the salesPerson one (level 1):
It’s impossible to tell the SQL engine to bring the details section one level behind when using AUTO.
FOR JSON PATH Flexibility
The docs guides suggest that “to maintain full control over the output of the FOR JSON clause, specify the PATH option”. This is the only way to get arrays of JSON objects nested as you want. You can format nested results by using dot-separated column names or by using nested queries. Let’s have some examples.
CASE 1: Simple Order Details
Let’s try using only dot-separated column names to get just order headers and order details:
Here the result:
As you can see, the details section is nested inside the main one, but the result consists in three order objects, one for each order line, and that is not what we want. Looking at the desired JSON result of fig. 3, details section has to be an array of lines. So the number of order objects has to be just two, one of which containing a nested details section having two lines. You can get this result using nested queries:
Now the result is the desired one:
CASE 2: Order and Sales Person Details
Now let’s add the Sales Person details to the previous query. The Sales Person section has to be at the same nesting level of the Details one. Both sections have to be arrays, so nested queries are the way to format your JSON in the right way:
Here the result:
The result looks quite good, except for the customFields value. It’s in turn a JSON text, but it isn’t interpreted as JSON. We need something to “parse” that text and transform it in full-fledged JSON.
CASE 3: Order and Sales Person Details Validating and Appending JSON text values
There are a bunch of SQL Server built-in functions to validate, query and change JSON data, as shown in doc guides. But there is nothing specific about how to validate and append a JSON text contained in a column. After playing a little bit with all the aforementioned functions I’ve obtained the desired result.
So the query used in the case 2 become as following:
Now the result is the one expected:
Performances of Nested Queries
Nested queries we used previously are also called correlated sub-queries and they often lead to poor performances when tables with a non trivial number of rows are involved. This is due to nested loop operators this sub-queries introduce in the query plan:
The nested loop join uses the top input as the outer input table and the bottom one as the inner input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.
Sometimes you have to deal with deep nested JSON files derived from the joining of huge tables. In that case, using correlated sub-queries in a shot to get all the nested sections will lead to a query having worst performances.
Get All Your Columns First And Then Use Nested Queries
A good strategy to improve the query performance is the following:
1) Get all the columns you need avoiding nested queries and just using simple joins, taking care to keep also in the SELECT statement all the JOIN columns used in each JOIN clause. You’ll get the benefit of any existing indexes to get this result set faster. Then persist it in a temp table.
2) Use the temp table multiple times, one for each nested section, using nested queries to format your JSON result.
An example is worth a thousand words. Consider the following query:
This is not the case, but if tables had been bigger than the ones used as source, the previous query would have taken a long time to complete. In that case, get all your needed columns in tabular form using simple joins (including also all the join columns of the involved tables), and then persist the result set in a temporary table:
Given the #Orders table, you can rewrite, for example, the nested section salesPerson in this way:
You have to use the GROUP BY since the #Orders table has a finer granularity then the salesPerson entity. You could be tempted to use the DISTINCT statement to eliminate duplicates, but in our case it’ll be a disaster from a performance point of view.
DISTINCT collects all of the rows and then tosses out duplicates. GROUP BY can filter out the duplicate rows before performing any of that work.
Each nested section has a WHERE clause used to be related to its external section. A GROUP BY will resolve the WHERE clause before getting rid of the duplicates, improving the performances of the query. For more information, read this blog post by Aaron Bertrand.
Since the details section contains two nested sections (details → stockitem → supplier), it’s recommended to persist all the details in another temp table to gain better performances in this way:
Notice that all the stockItem section is “compressed” in one column (stockItemJson) using JSON text. So you can easily expand it using JSON_QUERY as explained before.
At this point you can get the same result as the query in fig. 5 with the following one:
Performance Gains On a Production Environment
I’ve recently adopted this strategy for a customer who needed to extract JSON files from his data. One of the JSON files had these requirements:
- The base result set in tabular form is generated by the joining of 26 tables
- The top 3 biggest tables have 543mln, 105mln, 90mln of rows respectively
- The file has to be generated every 5 minutes, getting only the last arrived entities (from 3K to 5K entities)
- The deepest nested level of the resulting JSON is the 6th one
The execution of the “original” query, made up by correlated sub-queries directly based on source tables, was still running after 10 minutes.
After applying the upon mentioned strategy, I got the JSON file (using an Integration Services dtsx package; I could have used the BCP utility, but it had a bug with JSON results when I wrote this post) in just 1 minute and 30 seconds!
If your main data is persisted in an (Azure) SQL Server database (a version ≥ 2016 is needed for the on-premises database) and you need to extract your data in a JSON format, you don’t need to develop an external application to do that. (Azure) SQL Server already has built-in functions to manipulate JSON data.
Since the flexible formatting of a JSON text requires the using of correlated sub-queries to get it, it’s quite easy to write a query that will have bad performance when executed. So a strategy of proper refactoring of this “original” query has been explained in this blog post.