JSON Lines format: Why jsonl is better than a regular JSON for web scraping

Dmitry Narizhnykh
Dec 26, 2018 · 4 min read
Image for post
Image for post

CSV and JSON formats introduction

Comma Separated Values (CSV) format is a common data exchange format used widely for representing sets of records with identical list of fields.

id,father,mother,children
1,Mark,Charlotte,1
2,John,Ann,3
3,Bob,Monika,2
[
{"id":1,"father":"Mark","mother":"Charlotte","children":1},
{"id":2,"father":"John","mother":"Ann","children":3},
{"id":3,"father":"Bob","mother":"Monika","children":2},
]
[
{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]},
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]},
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]},
]

In order to insert or read a record from a JSON array you have to parse the whole file, which is far from ideal.

Since every entry in JSON Lines is a valid JSON it can be parsed/unmarshaled as a standalone JSON document. For example, you can seek within it, split a 10gb file into smaller files without parsing the entire thing.

1. No need do read the whole file in memory before parse. 2. You can easily add further lines to the file by simply appending to the file. If the entire file were a JSON array then you would have to parse it, add the new line, and then convert back to JSON.

So it is not practical to keep a multi-gigabyte as a single JSON array. Taking into consideration that Dataflow kit users would require to store and parse big volumes of data we’ve implemented export to JSONL format.

Image for post
Image for post

JSON Lines vs. JSON

Exactly the same list of families expressed as a JSON Lines format looks like this:

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

JSON Lines vs. JSON text sequences

Let’s compare JSON text sequence format and associated media type “application/json-seq” with NDJSON. It consists of any number of JSON texts, all encoded in UTF-8, each prefixed by an ASCII Record Separator (0x1E), and each ending with an ASCII Line Feed character (0x0A).

<RS>{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}<LF>
<RS>{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}<LF>
<RS>{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}<LF>

JSON Lines vs. Concatenated JSON

Another alternative to JSON Lines is concatenated JSON. In this format each JSON text is not separated from each other at all.

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

Pretty printed JSON formats

If you have large nested structures then reading the JSON Lines text directly isn’t recommended. Use the jq tool to make viewing large structures easier:

{
"id": 1,
"father": "Mark",
"mother": "Charlotte",
"children": [
"Tom"
]
}
{
"id": 2,
"father": "John",
"mother": "Ann",
"children": [
"Jessika",
"Antony",
"Jack"
]
}
{
"id": 3,
"father": "Bob",
"mother": "Monika",
"children": [
"Jerry",
"Karol"
]
}

Conclusion

The complete JSON Lines file as a whole is technically no longer valid JSON, because it contains multiple JSON texts.

The fact that every new line means a separate entry makes the JSON Lines formatted file streamable. You can read just as many lines as needed to get the same amount of records.

HackerNoon.com

#BlackLivesMatter

Sign up for Get Better Tech Emails via HackerNoon.com

By HackerNoon.com

how hackers start their afternoons. the real shit is on hackernoon.com. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Dmitry Narizhnykh

Written by

https://dataflowkit.com

HackerNoon.com

Elijah McClain, George Floyd, Eric Garner, Breonna Taylor, Ahmaud Arbery, Michael Brown, Oscar Grant, Atatiana Jefferson, Tamir Rice, Bettie Jones, Botham Jean

Dmitry Narizhnykh

Written by

https://dataflowkit.com

HackerNoon.com

Elijah McClain, George Floyd, Eric Garner, Breonna Taylor, Ahmaud Arbery, Michael Brown, Oscar Grant, Atatiana Jefferson, Tamir Rice, Bettie Jones, Botham Jean