A Memory-Friendly Way of Reading Files in Node.js
The need to read a file may arise in a variety of cases. It may be a one-time job of parsing error logs, a functionality of an application, a scheduled data migration task, part of a deployment pipeline, etc. Regardless of the reason, reading files in Node.js is a very simple and straightforward task. However, a problem occurs when the file size exceeds the amount of RAM in your machine. Beyond hardware limitation, RAM can be limited by your VPS provider, Kubernetes pod settings, etc.
fs.readFile won't do the job. And closing your eyes in hopes that it will somehow pass this time won't help.
It’s time to do some memory-aware programming.
In this article, I will look into three ways of reading files in Node.js. My goal is to find the most efficient approach in terms of memory usage. I will cover:
- Iterating over
fs.readwith a shared buffer
I implemented each approach as a small Node.js app and ran it inside a Docker container. Each application was given the task of processing a 1GB file in chunks of 10MB.
During the program execution, I measured the memory usage of the Docker container multiple times. Note that in the plots shown in this article, the measured memory is the sum of the memory used by the Node.js program itself and all processes in the Docker container (including the OS).
Have you noticed that the size of a single chunk is rather large? 10MB of text data holds 10,000,000 characters (one character in UTF-8 takes one byte). It far exceeds the number of characters in a single line of an ordinary log file or CSV. The size of a single line would be a reasonable size in a real-life application. I use chunks of comparable size to those of an idle Docker container. This way, any differences between implementations sensitive to chunk size will be more visible in the charts.
Let’s quickly see what we are dealing with. In the chart below, we can see a moving maximum of the memory usage of each program:
We clearly see that the worst is
readFileSync, which took over 1GB of memory. Next, with a much lower footprint, is
createReadStream. The best is
read, which uses less than 20MB (twice the chunk size).
The next plot shows the same data, but only for the last two functions:
So now that we have a nice overview, let’s jump into the implementation of each approach.
The Easiest — and Deadliest — Approach Is to
readFileSync a File
readFileSync, or its asynchronous sibling
readFile, is the easy choice. It takes only one line of code to read a file and then a single
for loop to iterate over the content:
The whole content of the file is kept in the
data variable. So no surprise here: It takes at least 1GB of RAM. This approach is clearly not the best when we are dealing with large files. However, its simplicity and the fact that we can have access to all data in a single line of code makes it worth considering with smaller files.
Let Off Some Steam With
In terms of simplicity of usage,
fs.crateReadStream is as simple as
readFile. It still takes a single line of code to read a file. In fact, these two code snippets even look similar. The difference is that this method returns a stream — not a content of a file. A stream needs additional processing to access actual data.
Streams may sound scary or could even be considered “advanced” programming. However, as you can see in the code example, we can just use a
for-await loop to read it. This makes it as easy as iterating over an array:
highWaterMark option tells Node to read only the provided number of bytes at once. That makes it more memory-efficient since there is a limited amount of data kept in memory in a single iteration.
This approach already gives us a much better result: 90MB. The memory usage of the container is ten times lower than in the previous example. Still, it is nine times larger than the chunk size.
Also, the deviation between measurements is rather large. However we instruct Node.js to read 10MB in a single iteration, it does not guarantee that there will be a single chunk kept in runtime memory in a single point in time. Old chunks are removed by a garbage collector eventually, but we have no control over when that happens.
There is a way to achieve better results both in the level and constancy of memory usage. Let’s see how to achieve it using
fs.read with a shared buffer.
More Control With R
ead and Shared Buffer
This approach is a little more complicated than the previous two. However, I was able to achieve the lowest memory usage and variation this way.
The application is composed of three simple parts that I will cover step by step:
- A promisified
- An async generator
- The main application loop to process data
But first, what is a “shared” buffer?
A shared buffer is a variable passed by reference to all functions. Instead of creating a new buffer in each function, I create a single buffer at the beginning of the program and pass it down. In the code examples, I refer to it by a variable named
sharedBuffer, so it should be clearly visible.
This is the actual technique that allows me to lower my memory usage and variation. In the chart below, there is a comparison between the two programs. Without a shared buffer, the program will make multiple copies of the same data, making it redundant. As we can see from the chart, it is very costly. Also, memory usage varies between 20MB to 80MB, which is due to garbage collection.
A shared buffer allows us to lower memory usage, making it more consistent as well.
readBytes — a promisified
First, we create a wrapper for
fs.read. Converting built-in
fs.read to a promise will simplify its usage. We invoke
fs.read with the following arguments, per the documentation:
fd— An integer representing a file descriptor. It will be created later in the program.
sharedBuffer— The buffer we are going to write data to.
0— The offset in the buffer to start writing at. We always write data at the beginning of
sharedBuffer.length—The number of bytes to read. In our case, it will always be the length of our buffer.
null— The position in the file to begin reading from. When the position is set to
null, the file will be read from the first byte and then the position will update itself automatically.
- The last argument is a callback function.
generateChunks — an asynchronous generator
for loop. We can think of them as dynamically updated arrays. In each step of an iteration, a generator may append the next element. That is what we are actually doing. In each step, we read the next portion of bytes and yield it as the next item of this “virtual array.”
With the iterator returned by the
generateChunks generator, we can iterate over the file in a
We have to be extra cautious in the last step of iteration. The final part of the data may be shorter than our buffer size. However, the size of the buffer itself is constant. This means that the buffer may consist of fresh chunks at the beginning and previous chunks at the end.
We handle this situation by accessing data with
buffer.slice(0, end). It returns part of the buffer from beginning to
end, which is equal to the number of bytes read in the current step.
Powerful and risky…
fs.read with a shared buffer is the most memory-friendly approach, the shared buffer part is a lite bit tricky. This technique, when used not carefully, may lead to:
- Data leak — When a buffer is populated with bytes from previous or future iterations and is wrongly processed as current. This may happen especially while reading the end of a file.
- Data malformation — When a shared buffer is unintentionally modified in a different part of the program.
Be extra cautious when using this approach!
fs.read can lower memory usage while reading files in Node.js. If the data structure allows it to be streamed, any of those two methods will be suitable. The amount of memory directly used by our application depends on the chunk size, which should be set with attention to features like the structure of data or available memory.
The differences between
fs.read are accentuated when the chunk size increases. Memory usage is about ten times lower for
fs.read. In this benchmark, I used a 10MB chunk, so a tenfold difference in memory usage gives 100MB. In a real-life application, chunks probably will be much smaller, so differences in the amount of used memory may become negligible. In this case, the method’s simplicity and security could be the deciding factors.