Working with JSON ( JSONL) & multiline JSON in Apache Spark
Few days back I was trying to work with Multiline JSONs (aka. JSON ) on Spark 2.1 and I faced a very peculiar issue while working on Single Line JSON(aka. JSONL or JSON Lines ) vs Multiline JSON files.
JSON Lines vs JSON
Consider an example, our JSON looks like below
here we can see we have 3 rows and all rows are enclosed inside an JSON array.
Now if we compare same data to be represented as JSON Lines, it would look something like
So, the difference clearly includes absence of following in JSONL file
- Square brackets representing array, so in JSONL files every lines is represented by a Record and delimited by New line.
- Commas, JSONL files do not require comma after every record.
Reading JSON & JSONL files
- By default Spark considers JSON files to be having JSON lines (JSONL format) and not Multiline JSON.
- There is a difference when it comes to working with JSON files for Spark versions prior to 2.2 and after Spark 2.2 release.
Working with Single Line Records prior Apache Spark 2.2
- when working with JSON files ( both JSONL and JSON), if whole record is present in single line, then we can simply read it using
spark.read.json(<path>)
- when we read this data we will get results as expected in form of dataframe
Working with Multi Line Records prior Apache Spark 2.2
- if we have data in multiple lines as below,
- if we try to simply read it we will get currupt_records
Solution
- to be able to read multiline JSON records prior to Spark 2.2, we will have to use sc.wholeTextFiles() , which will give us RDD which we can convert to DataFrame
Working with Multi Line Records post Apache Spark 2.2
- with Apache Spark 2.2+, it becomes very easy to work with Multiline JSON files, we just need to add option
multiline=’true’
- suppose we have following data
- now we can simply read it using spark.read.json() with option multiline=’true’
Thanks for reading, Please comment any queries or corrections.
#happy_coding #codebrace