Flattening Nested Data (JSON/XML) Using Apache-Spark

2 min readJun 21, 2020

Introduction:

The one thing we can all agree on is working with semi-structured data like JSON/XML using Spark is not easy as they are not SQL friendly.

The problem is with the nested schema with complex data types, which makes it difficult to apply SQL queries without the use of inbuilt functions like Spark SQL JSON functions.

In most cases, it’s better to flatten the nested structure for either transformations or analysis using SQL.

The approach in this article uses the Spark’s ability to infer the schema from files at loading, this schema will be used to programmatically flatten the complex types.

Code snippets and Explanation:

Implementation steps:

Load JSON/XML to a spark data frame.
Loop until the nested element flag is set to false.
Loop through the schema fields — set the flag to true when we find ArrayType and StructType.
For ArrayType — Explode and StructType — separate the inner fields.
It comes out once all the levels are flattened out.

Note: If we want to restore back to the nested structure post transformations or analysis, we can group by based on the columns outside ArrayType and combining columns into StructType basically reversing the flattening process.

Input-JSON: