fs.readFile vs streams to read text files in node.js
The de facto standard of reading text files in node.js is to use
For quick prototypes, or if we’re just dealing with small text files, this approach is completely fine. However, if there is a chance that other people might use our code, or that it may at some point be run on text files of varying sizes, we may as well optimize our code and consider using streams instead.
Streams allow us to keep data held in memory to a minimum, thus making reading a file that much quicker. They also allow us to pick and choose which parts of the file we want to process and leave the rest of the data unopened.
I was interested in the trade off between the ease-of-use of fs.readFile and the faster run-time of streams. At what file size does it make sense to refactor use of readFile and instead use streams?
I devised a simple test: create dummy data CSVs of various sizes and compare the time taken to extract the headers of these CSVs by 2 methods, one fs.readFile and one stream-based.
The first step is to use node.js to create some CSVs of various lengths:
node makedata.js will create 4 files with varying numbers of rows.
The next step is to run our 2 methods to read the headers of the CSV files:
read1 uses fs.readFile and to read in the entire contents of each CSV and then return the first line (the header). The next method
read2 uses streams to read in data from the CSV file and then returns the header.
The default setting in node.js for createReadStream is to read the file in 16KB chunks at a time. As soon as we have read the first chunk, we get the header of that string and then immediately destroy the stream so that it does not read any more chunks. This allows us to have fast performance no matter how big the file size is.
And here is the comparison for reading in CSVs of various sizes, ranging from 164KB to 160MB in size (x-axis is lines of CSV/file size, y-axis is the time taken for the relevant function of read.js to read the header):
As you can see, the stream method remains largely flat and fast, whereas the readFile method starts to increase noticeably as we approach ~16MB in file size.
TLDR: my suggestion is once you are planning on dealing with text files of greater size than around 10MB, it’s best to ditch readFile and start using streams instead.
There is one other thing to mention that the astute observer may have noticed: using a stream seems to be slightly faster when dealing with the 100000 lines file when compared to the smaller file sizes. I ran the code multiple times and still got the same result. Why is this? My guess is that it has something to do with the way the createReadStream splits the file into chunks. If anyone has a better idea, then let me know!