Working with JSON and nested arrays in Microsoft Fabric

Jacob Rønnow Jensen
5 min readDec 29, 2023

As Data Engineers, we get more and more semi-structured data as input for our data warehouses, and a lot of this in the form of JSON — a text-based data format used for storing key-value pairs and arrays of information. The reason being that JSON is a useful format for transmitting data inside or between applications and is also often used for storing data in document databases — the latter being a popular choice for operational workloads that require a flexible schema for fast, iterative development.

It is possible to store semi-structured data like JSON in a Lakehouse in Microsoft Fabric as well as make shortcuts from Fabric directly to outside sources. However, JSON is not a good format for analytics beyond the individual documents, and in my opinion, it only make sense to do analytics on the raw JSON in very specific scenarios — i.e. if it is a requirement of a down-stream process, if true real-time is of the essence or if we are talking about very small amounts of data and query performance is not an issue.

Having worked in a setup where data from millions of JSON documents needed to be made available for analysis on a daily basis, it is my experience that from an analytical point of view, it makes way more sense to move documents from the “Transactional Store” of the source system (optimized for reads and writes to a single document) to an Analytical Store (optimized for analytical queries) where the key-value pairs in the JSON are de-serialized and stored as columnar data as illustrated below.

From row Store to column store

Example basic JSON de-serialized

As an Analytical Store, Fabric has the advantage over a relational database, that data is stored as “schema-on-read” in delta parquet, and can seamlessly handle new key-value pairs in the source JSON — at the same time that data is available for efficient queries with both Spark, SQL and PowerBI without having to move data around to avoid locks or having to maintain schemas, indexes, partitions or data distributions.

Having access to the raw data from individual documents as one row in a Lakehouse makes it easy to validate the data in the bronze layer against the source (following the terminology of the Medallion Lakehouse Architectur). In addition to this, the Spark engine also addresses the small file problem by generating files with a default size of 128 MB and dynamically optimizing partitions.

There are several paths for ingesting JSON into a Lakehouse in Fabric, and in addition to the ones outlined in article Fabric decision guide — copy activity, dataflow, or Spark, there can also be circumstances where there is a need to react on streaming data (like with log analytics) and where Real-time Analytics might be a better choice. This scenario is described in the article Ingest, filter, and transform real-time events and send them in Delta Lake format to Microsoft Fabric Lakehouse .

https://learn.microsoft.com/en-us/fabric/real-time-analytics/overview

Working with arrays in JSON

As mentioned in the beginning of this article, in addition to key-value pairs, JSON can also contain arrays … and to complicate things even further, arrays can contain arrays — a structure often referred to as “nested arrays”.

Example: JSON with nested array

If the data in the arrays are homogeneous and structured by nature, I have found that it — for most analytical use cases — makes sense to iteratively de-serialize the arrays and persist them in their own tables with a foreign key relationship to the id of the original document. This also applies to nested arrays, that should get their own table with a relationship to the parent-array as well as ID of the root-element (for maintenance and data quality purposes). Based on the typical use of the information in the arrays, the related tables can be more or less normalized.

Iteratively de-serialyzing nested arrays in JSON

Although the Lakehouse in Fabric has transactional support, the Data Warehouse artifact is the only one that supports multi-table transactions — meaning that if one step in a transaction fails, previous writes to other tables/parquet file in the same transaction will be rolled back … not just the table/file that failed. This may be an argument for having a Lakehouse as the “bronze” layer and a Synapse Data Warehouse as “silver” layer in this scenario.

However, this is where the real beauty of Fabric, OneLake and Shortcuts kicks in.

Even though the bronze layer is a Lakehouse and the silver layer is a Data Warehouse, users with access can query them seamlessly, and shortcuts can be set up from the silver to the bronze layer where no additional transformation (and thus data movement) is needed.

If there is a need to handle (nested) arrays with full transactional support, the Data Warehouse artifact with full DQL, DML, and DDL T-SQL support (JSON_VALUE, JSON_QUERY, OPENJSON, … ) can be used … all while still operating in a schema-on-read paradigm, where new patterns in the JSON will not break our data processing downstream.

Lakehouse and Data Warehouse in a Medalliion Architecture

The flexibility and performance of OneLake, delta parquet and shortcuts and the option to seamlessly choose the data engineering tool that has the best fit for the task at hand, has the potential to eliminate a lot of the problems, we traditionally have had, when doing analysis on large amounts of data stored in JSON-documents.

There are still some set up to do and decisions to be made at the engineering level, but with Fabric, it has definitely become a lot simpler to get from data to information.

--

--