Flattening Nested Data (JSON/XML) Using Apache-Spark

Saikrishna Pujari
2 min readJun 21, 2020

--

Introduction:

The one thing we can all agree on is working with semi-structured data like JSON/XML using Spark is not easy as they are not SQL friendly.

The problem is with the nested schema with complex data types, which makes it difficult to apply SQL queries without the use of inbuilt functions like Spark SQL JSON functions.

In most cases, it’s better to flatten the nested structure for either transformations or analysis using SQL.

The approach in this article uses the Spark’s ability to infer the schema from files at loading, this schema will be used to programmatically flatten the complex types.

Code snippets and Explanation:

Implementation steps:

  1. Load JSON/XML to a spark data frame.
  2. Loop until the nested element flag is set to false.
  3. Loop through the schema fields — set the flag to true when we find ArrayType and StructType.
  4. For ArrayType — Explode and StructType — separate the inner fields.
  5. It comes out once all the levels are flattened out.

Note: If we want to restore back to the nested structure post transformations or analysis, we can group by based on the columns outside ArrayType and combining columns into StructType basically reversing the flattening process.

Input-JSON:

Output:

Output-Count:

For a single input JSON record, it resulted in 36 output records.

GITHUB URL:

I have provided driver classes for both Batch and Stream scenarios, but the streaming one is not tested fully.

Flatten Strategy: Provided two schema iteration strategies — 1.Iterative 2.Recursive

https://github.com/saikrishnapujari/Spark-Nested-Data-Parser

--

--

Saikrishna Pujari

Sr. Spark Solutions Engineer @ Databricks | Lead Data Engineer | Databricks Certified Spark Developer | GCP Data Engineer | Azure Data Engineer