Process a delimited file

Mayank C
Tech Tonic

--

Often times there is a need to process a delimited file such as comma separated, tab separated, line separated, etc. The file could be small or large, therefore the processing must be efficient in terms of memory. Regardless of the size of the file, there should not be any memory spike.

In this article, we’ll go over the efficient processing of a delimited file. We’ll consider two types of commonly delimited files: comma separated and line separated. The example files will be like this:

The CSV file contains a single line of comma separated values, while the LS file contains a value on each line

$ cat readingsCsv.txt 
8094,4953,6102,6704,5486,7302,7017,4616,3734,7537
$ cat readingsLS.txt
7465
1883
9529
3343
9429
3506
9141
3136
1601
8666

We’ll also do a round of testing on large size files such as 100M and 500M. The output of processing would be the average (or mean) of all the readings (or numbers) present in the file.

Let’s get started.

Basics

To efficiently process a comma or line separated file, we need to process the file in chunks. This ensures that there is no memory spike. If the entire file is loaded into memory, it may cause Deno’s memory to go into GBs (depends entirely on the file size).

There are three simple steps in processing a delimited file:

  • Get a Deno.Reader for the file
  • Use std’s async generator function readStringDelim to read file in chunks
  • Process the chunks

The steps are the same whether the file is comma separated or line separated. Let’s go over the steps in detail.

Step 1

To get a Deno.Reader for a file, simply open it using Deno.open API:

const reader = await Deno.open("file_path");

The Deno.open API returns a Deno.File object that implements the Deno.Reader interface.

Step 2

Once the file is opened, the data can be read in chunks using Deno’s standard library’s io module’s readStringDelim API. The API takes two inputs:

  • Deno.Reader
  • delimiter

The API returns an async iterable that can be iterated upon to read the file in chunks. Each chunk contains data to the next occurrence of the delimiter.

import { readStringDelim } from "https://deno.land/std/io/buffer.ts";const reader = await Deno.open("file_path");const dataSource = readStringDelim(reader , ",");
for await (const s of dataSource) {
//Process chunk
}

For big files, the process of reading chunk by chunk will be slow, but it won’t cause any memory spike.

Step 3

A chunk is whatever data found till the next occurrence of the delimiter. As we want to calculate average, we’ll need to:

  • convert chunk (string) to number
  • Add reading to total & increment reading count by 1

Once all chunks have been processed, total/count would be the average.

let total = 0, count = 0;
for await (const s of dataSource) {
total += Number(s);
++count;
}
const avg = total/count;

Complete Code

Now that we’ve gone through the steps, let’s write a small utility program that would take file path and delimiter as input, and produces average as the output. Here is the complete code of such application:

For line separated file, we need to pass \\n instead of \n as the second command-line arg

import { readStringDelim } from "https://deno.land/std/io/buffer.ts";if (Deno.args.length !== 2) {
console.error("Inputs required: <file> <delimiter>");
Deno.exit(1);
}
const reader = await Deno.open(Deno.args[0]);
const delim = Deno.args[1] === "\\n" ? "\n" : Deno.args[1];
let count = 0, total = 0;
const dataSource = readStringDelim(reader, delim);
for await (const s of dataSource) {
if (!s) break;
total += Number(s);
++count;
}
const avg = total / count;
console.log("Average of", count, "readings is", avg);

First, let’s do a round of testing on small files (10 readings in a file):

$ cat readingsCsv.txt 
1868,7406,6331,9439,2512,8144,2351,9128,3840,1691,
$ cat readingsLS.txt
7465
1883
9529
3343
9429
3506
9141
3136
1601
8666
$ deno run --allow-read=./ app.ts ./readingsLS.txt \\n
Average of 10 readings is 5769.9
$ deno run --allow-read=./ app.ts ./readingsCsv.txt ,
Average of 10 readings is 5271

The application works perfectly.

Now, let’s do a quick test on large size files:

$ du -kh readings100M.txt 
100M readings100M.txt
$ deno run --allow-read=./ app.ts ./readings100M.txt ,
Average of 20971520 readings is 5503.079414463044

In the above example, Deno processed ~20 million readings from a file of approx. 100M size. The maximum memory usage was ~34M.

Lastly, we’ll do a quick test on 500M file:

$ du -kh readings500M.txt 
500M readings500M.txt
$ deno run --allow-read=./ app.ts ./readings500M.txt ,
Average of 104857600 readings is 5503.069314060212

In the above example, Deno processed ~100 million readings from a file of approx. 500M size. The maximum memory usage was ~40M.

As we can see, the memory usage was almost the same regardless of the size of the file.

--

--