How to Flatten Json Files Dynamically Using Apache PySpark(Python)

Thomas Thomas
3 min readFeb 5, 2022

There are several file types are available when we look at the use case of ingesting data from different sources. Some of them are Parquet, Avro, ORC, CSV, JSON, XML..etc. Json is the most commonly used file type when designing a data service or data streaming integration pattern for data engineering work.

A JSON file is a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format, which is a standard data interchange format. JSON files are lightweight, text-based, human-readable, and can be edited using a text editor.

We all had to deal with the use case of flattening Json using Spark in the data engineering world at least once. I had multiple use cases in the past to flatten 100+ Jsons from different data sources. Rather than writing code for flattening each Json, started looking for a generic function to flatten any Json with any levels of nesting, which would save a lot of time. My search for the generic function motivated me to write this article.

Spark provides many functions to handle Jsons. The two most commonly used ones are Explode and Explode_outer functions.

All source code in this article is written in Python and Spark(Pyspark ver 3.1.2). This article is code-oriented and self-explanatory. However, I will provide some explanations as we flow through the code. Now, let’s Open our favorite IDE and start coding…

Step1:Download a Sample nested Json file for flattening logic.

Step2: Create a new python file flatjson.py and write Python functions for flattening Json.

Step3: Initiate Spark Session.

Step4:Create a new Spark DataFrame using the sample Json.

The output of the above data frame is given below.

Step5: Flatten Json in Spark DataFrame using the above function.

When you execute the program you will get a flattened Spark DataFrame as below:

The program marks each level of json with *1, *2 like that.. and “->” shows the child node of a parent node.

I have tested the program with many complex Json and so far every Json flattened successfully. However the program is still in the beta testing mode, so please test thoroughly for your use case before using this code in an actual production environment.

Make sure you have appropriate Spark partitions when you deal with large data sets.

“Happy Coding…”

--

--

Thomas Thomas

Principal Data Engineer | Data & AI | AWS | Azure| Databricks Certified Data Engineer Professional| Spark|Python| Author