JSON line format is better than JSON when storing list of objects in file

Yiqun Rong
6 min readSep 6, 2023

--

Image that you are working on the reporting of user’s data which is from various data sources, such as ElasticSearch, DynamoDb. And you want to export those data to a file or in S3 bucket file so that you generate report from it.

What will be best data format when saving them to a local file or upload it to the S3 bucket? JSON format may be the first format you may come up with in your mind. The JSON object array will look like this below

[
{
"id": 1,
"first_name": "Arvy",
"last_name": "Derisley",
"email": "aderisley0@4shared.com",
"gender": "Male",
"ip_address": "192.49.240.128"
},
{
"id": 2,
"first_name": "Stanley",
"last_name": "Arnefield",
"email": "sarnefield1@flickr.com",
"gender": "Male",
"ip_address": "138.131.169.201"
}
]

Other format is JSON line format (JSONL). Each line represent a complete JSON object ended with the new line character \n. It look like this below

{"id": 1, "first_name": "Arvy", "last_name": "Derisley", "email": "aderisley0@4shared.com", "gender": "Male", "ip_address": "192.49.240.128"}
{"id": 2, "first_name": "Stanley", "last_name": "Arnefield", "email": "sarnefield1@flickr.com", "gender": "Male", "ip_address": "138.131.169.201"}

There are 3 advantages using JSON line format over using JSON format when storing list of object in file.

Saving/Loading array of objects to/from JSON Line file can reduce memory and CPU usage compared with JSON file

Before you can save array of objects to JSON file, you need to do JSON.stringify() to serialise the huge Javascript Object which is an array. But it will be very slow and induce huge memory usage if you try to save a huge array of objects in JSON format. Similarly, it also have bad performance when loading objects from JSON file.

const fs = require('fs/promises');
const people = [
{
"id": 1,
"first_name": "Arvy",
"last_name": "Derisley",
"email": "aderisley0@4shared.com",
"gender": "Male",
"ip_address": "192.49.240.128"
},
{
"id": 2,
"first_name": "Stanley",
"last_name": "Arnefield",
"email": "sarnefield1@flickr.com",
"gender": "Male",
"ip_address": "138.131.169.201"
}
]

async function saveFile(items, fileName) {
try {
await fs.writeFile(fileName, JSON.stringify(items)); //slow
} catch (err) {
console.log(err);
}
}

async function readFile(fileName) {
try {
const content = await fs.readFile(fileName, { encoding: 'utf8' });
return JSON.parse(content); //slow
} catch (err) {
console.log(err);
}
}

(async () => {
//save people to file
await saveFile(people, "people.json");
console.log("people saved")
//get people from file
await readFile("people.json");
console.log("people loaded", people)
})()

How can we do if we use the JSON Line file instead? Instead of saving the stringfied object of a huge array but we can save individual objects line by line. Node Stream API is very useful here.

const { createWriteStream, createReadStream }= require('fs');
const { Readable, Transform, pipeline } = require("stream");
const { promisify } = require("util");
const pipelinePromise = promisify(pipeline);
const split2 = require("split2");

const people = [
{
"id": 1,
"first_name": "Arvy",
"last_name": "Derisley",
"email": "aderisley0@4shared.com",
"gender": "Male",
"ip_address": "192.49.240.128"
},
{
"id": 2,
"first_name": "Stanley",
"last_name": "Arnefield",
"email": "sarnefield1@flickr.com",
"gender": "Male",
"ip_address": "138.131.169.201"
}
]

async function saveItemsToFile(items, fileName) {
try {
const writable = createWriteStream(fileName);
// write stream can be replaced by S3 upload stream
const transformToText = new Transform({
objectMode: true,
transform(item, encoding, callback) {
this.push(JSON.stringify(item)); // stringify individual object
this.push("\n"); // delimited by the new line character
callback();
},
});
await pipelinePromise(Readable.from(items), transformToText, writable);
} catch (err) {
console.log(err);
}
}

const getItemsFromFileStream = (fileName) => {
const readable = createReadStream(fileName);
const lineToObjectTransform = new Transform({
objectMode: true,
transform(chunk, encoding, callback) {
const item = JSON.parse(chunk.toString());
this.push(item);
callback();
},
});

return readable.pipe(split2()).pipe(lineToObjectTransform);
// return a stream of objects when reading the file content chunk by chunk
};

const getItemsFromFile = async (fileName) => {
const objects = [];
const stream = getItemsFromFileStream(fileName);
for await (const object of stream){
objects.push(object)
}
return objects;
}

(async () => {
//save people to file
await saveItemsToFile(people, "people.jsonl");
console.log("people saved")
//get people from file
const loaded_people = await getItemsFromFile("people.jsonl");
console.log("people loaded", loaded_people)
})()

The most important feature from JSON line is that it allows getting and processing objects from the file chunk by chunk without finishing reading the the whole file. This mechanism can be achieved by using Nodejs Stream and the library split2. This library breaks up a stream and reassemble it so that each line is a chunk, then we can parse each line as individual objects. The function getItemsFromFileStream() in the example can be the source of input for any data processing down the stream.

JSON Line file can be compressed and It still can be read chuck by chuck

A data file can be very large, data transfer costs can be expensive especially if it use services from the cloud provider such AWS. In addition, IO is usually the bottleneck in the distributed system. For example, reading a huge data file from S3 bucket and process it in the Lambda function can be slow and expensive.

If the data is saved in JSON line format, then the data stream can be compressed, reducing data transfer rate. So that we can have a compressed JSON line file. Similarly the compressed JSON line file can be read chuck by chuck. This is achieved by using zlib library in NodeJs Stream API.

const { createWriteStream, createReadStream }= require('fs');
const { Readable, Transform, pipeline } = require("stream");
const { promisify } = require("util");
const pipelinePromise = promisify(pipeline);
const split2 = require("split2");
const { createGzip, createGunzip } = require('zlib');

const people = [
{
"id": 1,
"first_name": "Arvy",
"last_name": "Derisley",
"email": "aderisley0@4shared.com",
"gender": "Male",
"ip_address": "192.49.240.128"
},
{
"id": 2,
"first_name": "Stanley",
"last_name": "Arnefield",
"email": "sarnefield1@flickr.com",
"gender": "Male",
"ip_address": "138.131.169.201"
}
]

async function saveItemsToFile(items, fileName) {
try {
const writable = createWriteStream(fileName);
const transformToText = new Transform({
objectMode: true,
transform(item, encoding, callback) {
this.push(JSON.stringify(item)); // stringify individual object
this.push("\n"); // delimited by the new line character
callback();
},
});
await pipelinePromise(
Readable.from(items),
transformToText,
createGzip(), // compress text stream before write
writable
);
} catch (err) {
console.log(err);
}
}

const getItemsFromFileStream = (fileName) => {
const readable = createReadStream(fileName);
const lineToObjectTransform = new Transform({
objectMode: true,
transform(chunk, encoding, callback) {
const item = JSON.parse(chunk.toString());
this.push(item);
callback();
},
});

return readable
.pipe(createGunzip()) // decompress text stream before read
.pipe(split2())
.pipe(lineToObjectTransform);
// return a stream of objects when reading the file content chunk by chunk
};

const getItemsFromFile = async (fileName) => {
const objects = [];
const stream = getItemsFromFileStream(fileName);
for await (const object of stream){
objects.push(object)
}
return objects;
}

(async () => {
//save people to file
await saveItemsToFile(people, "people.jsonl.gzip");
console.log("people saved")
//get people from file
const loaded_people = await getItemsFromFile("people.jsonl.gzip");
console.log("people loaded", loaded_people)
})()

The usual compression rate for JSONL file in Gzip can be 2:1 or 3:1, which means that you can save at least a half of data transfer cost and reduce loading time in your application.

There is a fun fact. You can even join 2 compressed JSON line files either using the nodeJs API or using the Linux command cat people1.jsonl.gz people2.jsonl.gz > people.jsonl.gz. The resulting compressed Gzip files can be decompressed and parsed using the same method in Nodejs Stream.

JSON Line format is widely used in many AWS services.

Here are some AWS services and features that can accept or work with JSON Lines:

  1. Amazon S3 Select & Amazon Glacier Select: These services allow you to run SQL-style queries on your data without having to load them into a database. You can use S3 Select or Glacier Select to query JSONL data stored in S3 or Glacier.
  2. Amazon Elastic MapReduce (EMR): When processing large datasets using distributed frameworks like Apache Spark or Hadoop on EMR, JSONL can be a preferred format since each line can be processed independently.
  3. AWS Glue: AWS Glue is a fully managed ETL service that prepares and loads data for analysis. Glue can read data stored in JSONL format and transform or load it as necessary.
  4. Amazon Kinesis Firehose: When sending streaming data to destinations like Amazon S3, Elasticsearch, or Redshift, Kinesis Firehose can accept and work with JSONL-formatted data.
  5. Amazon Athena: Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. You can define tables in Athena where the underlying data is in JSONL format.
  6. AWS Batch: When processing batch workloads, if your application or script expects data in JSONL format, you can input such files for processing.
  7. Amazon SageMaker Ground Truth: For certain types of labeling jobs, such as object detection, you might provide input datasets in JSONL format.

conclusion

In conclusion, the JSON Line format is very versatile when saving an array of huge number of objects. Using it with Nodejs Stream is very good option if the performance is a real matter to your application.

--

--