Working with JSON ( JSONL) & multiline JSON in Apache Spark

Published in

Codebrace

3 min readJan 10, 2021

Few days back I was trying to work with Multiline JSONs (aka. JSON ) on Spark 2.1 and I faced a very peculiar issue while working on Single Line JSON(aka. JSONL or JSON Lines ) vs Multiline JSON files.

JSON Lines vs JSON

Consider an example, our JSON looks like below
here we can see we have 3 rows and all rows are enclosed inside an JSON array.

Now if we compare same data to be represented as JSON Lines, it would look something like

So, the difference clearly includes absence of following in JSONL file

Square brackets representing array, so in JSONL files every lines is represented by a Record and delimited by New line.
Commas, JSONL files do not require comma after every record.

Reading JSON & JSONL files

By default Spark considers JSON files to be having JSON lines (JSONL format) and not Multiline JSON.
There is a difference when it comes to working with JSON files for Spark versions prior to 2.2 and after Spark 2.2 release.

Working with Single Line Records prior Apache Spark 2.2

when working with JSON files ( both JSONL and JSON), if whole record is present in single line, then we can simply read it using

spark.read.json(<path>)

Json file having everything in single line ( minified JSON)

when we read this data we will get results as expected in form of dataframe

Reading JSON or JSONL files using Spark < 2.2

Working with Multi Line Records prior Apache Spark 2.2

if we have data in multiple lines as below,

if we try to simply read it we will get currupt_records

reading Multiline Json via read.json with Spark < 2.2

Solution

to be able to read multiline JSON records prior to Spark 2.2, we will have to use sc.wholeTextFiles() , which will give us RDD which we can convert to DataFrame