It has to be done in this way because Spark processes data in a distributed way, which means a JSON…

1 min readNov 14, 2016

It has to be done in this way because Spark processes data in a distributed way, which means a JSON input file may be split into multiple parts and gets processed by multiple tasks.

During this process, we’d like to ensure that every single JSON record is fully processed by one and only one task. This requires us to determine boundaries between adjacent JSON records so that we don’t accidentally truncate a JSON record. A single newline is one of the simplest record boundary that requires the least parsing efforts.

On the contrary, to allow multi-line JSON records, we’ll have to parse every single JSON records just to determine the boundary (simple RegEx tricks doesn’t help here since JSON uses a context sensitive syntax). This may dramatically impact runtime efficiency.

Written by Cheng Lian