JSON Lines format: Why jsonl is better than a regular JSON for web scraping

CSV and JSON formats introduction

id,father,mother,children
1,Mark,Charlotte,1
2,John,Ann,3
3,Bob,Monika,2
[
{"id":1,"father":"Mark","mother":"Charlotte","children":1},
{"id":2,"father":"John","mother":"Ann","children":3},
{"id":3,"father":"Bob","mother":"Monika","children":2},
]
[
{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]},
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]},
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]},
]

In order to insert or read a record from a JSON array you have to parse the whole file, which is far from ideal.

Since every entry in JSON Lines is a valid JSON it can be parsed/unmarshaled as a standalone JSON document. For example, you can seek within it, split a 10gb file into smaller files without parsing the entire thing.

1. No need do read the whole file in memory before parse. 2. You can easily add further lines to the file by simply appending to the file. If the entire file were a JSON array then you would have to parse it, add the new line, and then convert back to JSON.

So it is not practical to keep a multi-gigabyte as a single JSON array. Taking into consideration that Dataflow kit users would require to store and parse big volumes of data we’ve implemented export to JSONL format.

JSON Lines vs. JSON

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

JSON Lines vs. JSON text sequences

<RS>{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}<LF>
<RS>{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}<LF>
<RS>{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}<LF>

JSON Lines vs. Concatenated JSON

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

Pretty printed JSON formats

{
"id": 1,
"father": "Mark",
"mother": "Charlotte",
"children": [
"Tom"
]
}
{
"id": 2,
"father": "John",
"mother": "Ann",
"children": [
"Jessika",
"Antony",
"Jack"
]
}
{
"id": 3,
"father": "Bob",
"mother": "Monika",
"children": [
"Jerry",
"Karol"
]
}

Conclusion

The complete JSON Lines file as a whole is technically no longer valid JSON, because it contains multiple JSON texts.

The fact that every new line means a separate entry makes the JSON Lines formatted file streamable. You can read just as many lines as needed to get the same amount of records.