A Memory-Friendly Way of Reading Files in Node.js

How to read gigabytes of data with a limited amount of memory

Kasper Moskwiak
Jan 4 · 7 min read
Cup of liquid spilling over
Cup of liquid spilling over
Photo by Zac Harris on Unsplash.

The need to read a file may arise in a variety of cases. It may be a one-time job of parsing error logs, a functionality of an application, a scheduled data migration task, part of a deployment pipeline, etc. Regardless of the reason, reading files in Node.js is a very simple and straightforward task. However, a problem occurs when the file size exceeds the amount of RAM in your machine. Beyond hardware limitation, RAM can be limited by your VPS provider, Kubernetes pod settings, etc.

A simple fs.readFile won't do the job. And closing your eyes in hopes that it will somehow pass this time won't help.

It’s time to do some memory-aware programming.

In this article, I will look into three ways of reading files in Node.js. My goal is to find the most efficient approach in terms of memory usage. I will cover:

  1. Built-in fs.readFileSync
  2. Iterating over fs.createReadStream
  3. fs.read with a shared buffer

The Experiment

I implemented each approach as a small Node.js app and ran it inside a Docker container. Each application was given the task of processing a 1GB file in chunks of 10MB.

During the program execution, I measured the memory usage of the Docker container multiple times. Note that in the plots shown in this article, the measured memory is the sum of the memory used by the Node.js program itself and all processes in the Docker container (including the OS).

Have you noticed that the size of a single chunk is rather large? 10MB of text data holds 10,000,000 characters (one character in UTF-8 takes one byte). It far exceeds the number of characters in a single line of an ordinary log file or CSV. The size of a single line would be a reasonable size in a real-life application. I use chunks of comparable size to those of an idle Docker container. This way, any differences between implementations sensitive to chunk size will be more visible in the charts.

Let’s quickly see what we are dealing with. In the chart below, we can see a moving maximum of the memory usage of each program:

Moving maximum of memory usage of createReadStream, read, and readFileSync.
Moving maximum of memory usage of createReadStream, read, and readFileSync.
Moving maximum of memory usage of createReadStream, read, and readFileSync.

We clearly see that the worst is readFileSync, which took over 1GB of memory. Next, with a much lower footprint, is createReadStream. The best is read, which uses less than 20MB (twice the chunk size).

The next plot shows the same data, but only for the last two functions:

Moving maximum of memory usage of createReadStream and read.
Moving maximum of memory usage of createReadStream and read.
Moving maximum of memory usage of createReadStream and read.

So now that we have a nice overview, let’s jump into the implementation of each approach.

The Easiest — and Deadliest — Approach Is to readFileSync a File

readFileSync, or its asynchronous sibling readFile, is the easy choice. It takes only one line of code to read a file and then a single for loop to iterate over the content:

The whole content of the file is kept in the data variable. So no surprise here: It takes at least 1GB of RAM. This approach is clearly not the best when we are dealing with large files. However, its simplicity and the fact that we can have access to all data in a single line of code makes it worth considering with smaller files.

Let Off Some Steam With createReadStream

In terms of simplicity of usage, fs.crateReadStream is as simple as readFile. It still takes a single line of code to read a file. In fact, these two code snippets even look similar. The difference is that this method returns a stream — not a content of a file. A stream needs additional processing to access actual data.

Streams may sound scary or could even be considered “advanced” programming. However, as you can see in the code example, we can just use a for-await loop to read it. This makes it as easy as iterating over an array:

The highWaterMark option tells Node to read only the provided number of bytes at once. That makes it more memory-efficient since there is a limited amount of data kept in memory in a single iteration.

This approach already gives us a much better result: 90MB. The memory usage of the container is ten times lower than in the previous example. Still, it is nine times larger than the chunk size.

Also, the deviation between measurements is rather large. However we instruct Node.js to read 10MB in a single iteration, it does not guarantee that there will be a single chunk kept in runtime memory in a single point in time. Old chunks are removed by a garbage collector eventually, but we have no control over when that happens.

There is a way to achieve better results both in the level and constancy of memory usage. Let’s see how to achieve it using fs.read with a shared buffer.

More Control With Read and Shared Buffer

This approach is a little more complicated than the previous two. However, I was able to achieve the lowest memory usage and variation this way.

The application is composed of three simple parts that I will cover step by step:

  1. A promisified fs.read
  2. An async generator
  3. The main application loop to process data

But first, what is a “shared” buffer?

Diagram comparing separate buffers vs. shared buffer
Diagram comparing separate buffers vs. shared buffer
Separate buffers vs. shared buffer

A shared buffer is a variable passed by reference to all functions. Instead of creating a new buffer in each function, I create a single buffer at the beginning of the program and pass it down. In the code examples, I refer to it by a variable named sharedBuffer, so it should be clearly visible.

This is the actual technique that allows me to lower my memory usage and variation. In the chart below, there is a comparison between the two programs. Without a shared buffer, the program will make multiple copies of the same data, making it redundant. As we can see from the chart, it is very costly. Also, memory usage varies between 20MB to 80MB, which is due to garbage collection.

A shared buffer allows us to lower memory usage, making it more consistent as well.

Image for post
Image for post
Memory usage with and without shared buffer

readBytes — a promisified fs.read

First, we create a wrapper for fs.read. Converting built-in fs.read to a promise will simplify its usage. We invoke fs.read with the following arguments, per the documentation:

  • fd — An integer representing a file descriptor. It will be created later in the program.
  • sharedBuffer — The buffer we are going to write data to.
  • 0 — The offset in the buffer to start writing at. We always write data at the beginning of sharedBuffer.
  • sharedBuffer.length —The number of bytes to read. In our case, it will always be the length of our buffer.
  • null — The position in the file to begin reading from. When the position is set to null, the file will be read from the first byte and then the position will update itself automatically.
  • The last argument is a callback function.

generateChunks — an asynchronous generator

Generators in JavaScript return iterators — a special object we can use in a for loop. We can think of them as dynamically updated arrays. In each step of an iteration, a generator may append the next element. That is what we are actually doing. In each step, we read the next portion of bytes and yield it as the next item of this “virtual array.”

With the iterator returned by the generateChunks generator, we can iterate over the file in a for loop:

We have to be extra cautious in the last step of iteration. The final part of the data may be shorter than our buffer size. However, the size of the buffer itself is constant. This means that the buffer may consist of fresh chunks at the beginning and previous chunks at the end.

We handle this situation by accessing data with buffer.slice(0, end). It returns part of the buffer from beginning to end, which is equal to the number of bytes read in the current step.

Powerful and risky…

While fs.read with a shared buffer is the most memory-friendly approach, the shared buffer part is a lite bit tricky. This technique, when used not carefully, may lead to:

  • Data leak — When a buffer is populated with bytes from previous or future iterations and is wrongly processed as current. This may happen especially while reading the end of a file.
  • Data malformation — When a shared buffer is unintentionally modified in a different part of the program.

Be extra cautious when using this approach!

Summary

Both fs.createReadStream and fs.read can lower memory usage while reading files in Node.js. If the data structure allows it to be streamed, any of those two methods will be suitable. The amount of memory directly used by our application depends on the chunk size, which should be set with attention to features like the structure of data or available memory.

The differences between fs.createReadStream and fs.read are accentuated when the chunk size increases. Memory usage is about ten times lower for fs.read. In this benchmark, I used a 10MB chunk, so a tenfold difference in memory usage gives 100MB. In a real-life application, chunks probably will be much smaller, so differences in the amount of used memory may become negligible. In this case, the method’s simplicity and security could be the deciding factors.

Better Programming

Advice for programmers.

Sign up for The Best of Better Programming

By Better Programming

A weekly newsletter sent every Friday with the best articles we published that week. Code tutorials, advice, career opportunities, and more! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Thanks to Zack Shapiro

Kasper Moskwiak

Written by

Software engineer 👨‍💻 https://kspr.dev

Better Programming

Advice for programmers.

Kasper Moskwiak

Written by

Software engineer 👨‍💻 https://kspr.dev

Better Programming

Advice for programmers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store