Debugging Memory Leaks in a Containerised Node Application

Todd Runham
Sep 28 · 7 min read
Image for post
Image for post
Photo by Luis Tosta on Unsplash

At Gousto we‘d been struggling to upgrade our server-side rendered React 15 app to React 16 due to an elusive memory leak. One that had either been triggered or amplified by the version change.

Several attempts were unsuccessful, so we decided to take a step back and approach it from a different angle.

While our leak concerned a containerised React application, the methods and solutions detailed here will apply to any JS framework with a Node-based server.

“In essence, memory leaks can be defined as memory that is not required by an application anymore that for some reason is not returned to the operating system or the pool of free memory.” — Sebastian Peyrott

What we’ll be using

To debug these leaks, it makes sense to get access to the containers memory allocation locally. We can then rapidly audit the data being created by our application, and identify what memory isn’t being “returned to the operating system.”

Testing possible fixes for these anomalies locally is crucial, as it may involve a lot of trial and error

You’ll also need some metrics to validate these fixes. A good source for this is CPU and memory usage.

To satisfy these conditions we’re going to be using a combination of Prometheus and Grafana for the metrics, Chrome’s Node inspector to delve into the heap memory and Locust to simulate typical user behaviour across our application.

Initial setup

At Gousto, our web stack uses Docker, so we’ll be focusing on setting up the tooling for that. We’ll need to expose the relevant Docker ports to our localhost so we can retrieve the appropriate data. Don’t forget to shut these ports off when pushing to a non-local environment!

Setting up the tooling for non-containerised environments should be straightforward, as you can hook directly into the processes and ports.

Prometheus & Grafana

Prometheus is an open-source monitoring and alerting tool originally built at Soundcloud. For what we’re trying to solve here, you wouldn’t have to dive into the Prometheus UI as Grafana will be taking responsibility for that part of our debugging stack. We’ll just be using it for storing and querying the container metrics.

Grafana is also an open-source tool, geared towards dynamic visualisation of data.

Getting these two up and running is easy, thanks to work done by Brian Christner and various contributors to create a Docker Compose stack containing everything you need. You can follow the installation instructions here.

The great thing about this is once the set up is complete, it’s simple to stop and start these services through the Docker CLI for future use.

The Docker Compose stack also comes with a pre-built Grafana dashboard, which you can install through Grafana’s handy import feature.

If everything has gone to plan, you should be able to see the following in Grafana (localhost:3000) when selecting the installed dashboard:

Image for post
Image for post

Chrome Node Inspector

Chrome’s developer tools come with a built-in way of debugging node instances. You can view this by heading to chrome://inspect and selecting “Open dedicated DevTools for Node”.

If you're running Node locally outside of containers, it should just work out of the box. All you need to do is start your Node app with the --inspect flag, open the inspector, and everything should hook itself together.

There is more configuration needed if we’re working with Docker containers, as we have to expose the node instance to the localhost. To do this, we need to add ports: — “9229:9229” to our docker-compose.yml in the relevant service block.

Full Docker Compose YAML based service block, including the necessary port mapping

If you’re not using Docker Compose you can add -p 9229:9229 to your docker run command.

We also need to add --inspect=0.0.0.0:9229 to our Node app start command.

Locust

At Gousto we initially used the NPM package loadtest as it’s a great tool to get started quickly. Recently, we’ve switched to Locust.

Locust allows you to write load testing scenarios in code, giving you the ability to simulate real-world usage, such as spreading traffic across various areas of your application. Without this, it would be difficult to replicate and identify the memory leak.

This element of our stack should be agnostic of whether you’re running containers or not, as long as you are serving your app on a URI, it will work. Follow the instructions here to install.

An example of a Locust load test — don’t forget to add the correct port to the host property if necessary

Using the filename locustfile.py means Locust will automatically pick the test up.

Verifying the memory leak

Now we have everything set up. It’s time to get an idea of what we’re dealing with. Let’s start by getting some visualisations of the leak, so we’re aware of the scale.

To do this, we’re going to start our Docker container to run some load testing on our app, using ~10-minute sequential bursts. The number of users and hatch rate you set in Locust is dependent on your system.

It’s best to use trial and error here and start with lower values such as ten users at a hatch rate of 1, and increase until you feel the load would be detrimental to the test rather than helpful.

You don’t want to max out your memory debugging a memory leak!

If you do indeed have a memory leak, you will see something similar to the following when viewing the CPU usage chart in the Grafana dashboard, and selecting the problematic container:

Image for post
Image for post
You can see the 7-8-minute bursts and the memory accumulating for ~30 minutes, which is indicative of a leak.

Fixing the memory leak

Now that we’ve verified we have a memory leak. We need to start our debugging phase. For this, we’re going to be using the Chrome dev tools node-inspector we set up previously.

Memory leaks fall into two categories — ones that occur once, and repetitive leaks that build up over time

The former is rarely detrimental to CPU usage and often goes unnoticed. The latter, which we will be focusing on, should be treated as high priority as they can cause critical user-facing issues.

To see what memory is leaking, we’re going to start recording the heap using the allocation instrumentation timeline. We’ll then repeat a load test we did previously for ~5 minutes, after which the snapshot will render.

Much longer than this and the snapshot filesize can become too large to analyse efficiently — although this will depend on how much memory your app is assigning.

Image for post
Image for post
The allocation instrumentation timeline can be found here in node-inspector
Image for post
Image for post
Once the snapshot has rendered, you will see the memory allocation

As you can see, objects and arrays are taking up a considerable amount of memory.

Leaks are often associated with duplicated, low distance instances as seen below:

Image for post
Image for post

Low and repetitive distances mean similar memory is being allocated repeatedly (usually in a loop) and is most likely adding little value to the application.

Opening these objects and checking the properties and methods should give you an idea of what type of reference is being leaked

In our case, we noticed a lot of objects had properties associated with fetch functionality and metadata. With this information, we were able to hunt through our codebase and find we were pre-loading all of our components with fetch functionality on the server, in a particularly inefficient way.

This led to memory not being collected, and removing this feature stabilised the memory usage and increased app load speed

We were also using a deprecated function in the library react-helmet, which is known to cause memory issues. Updating to the suggested practice also drastically improved memory usage and removed the excessive metadata related objects we spotted earlier.

Repeating references may not offer a clue as to what the source of the leak is. In this case, it’s best to research common Javascript memory leak issues and check if any of the patterns are replicated within your codebase.

It’s also possible that the version of a JS framework or library you’re using has a problem. Some quick research against each of your package dependencies will highlight whether this is true or not.

Modern browsers tend to be very efficient at cleaning up unreachable memory on the client, but if issues do occur on the server, having the right tools locally to diagnose the problem will shave days or even weeks off of your investigation.

You can also use these same tools to optimise the performance of your app. Or even create automation to run CPU diagnostics in your testing pipeline, to ensure you’re not introducing new leaks or regressing on the original issue.

Gousto Engineering & Data Science

Gousto Engineering Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store