Handling false memory leaks

Published in

Find Work

4 min readApr 28, 2017

For the past few months, we have gotten infrequent alerts from Librato regarding hitting our memory usage threshold (600MB/1024MB). Up until today, I thought it was a slow memory leak (likely due to previously seeing articles like Eran Hammer’s Walmart Labs post) but it turns out, it was we never set a maximum GC size for Node.js so it was growing in an unbounded fashion.

The rest of this article will go into how we came to our conclusion and resolution.

This might look like a memory leak but it’s a large max GC size issue

Collecting the facts

Upon receiving our alert, we looked at Librato to confirm the issue:

Then, we ssh’d into our server and looked athtop to find the verdict. We forgot to take a screenshot of this but the numbers were:

88MB used by server processes (x2)
313MB used by queue process

313MB is way more than we expected for our queue process. After this, we checked the uptime via supervisor to make sure no recent restarts had occurred which caused the memory drops in Librato:

Screenshot of uptime via `supervisor` (i.e. `supervisorctl status`)

After restarting our queue process (i.e. supervisorctl restart app-queue), the memory usage dropped to 82MB.

Since the queue hadn’t been restarted in 8 days and the memory drops in Librato occurred 1 and 4 days ago, we know that the issue must be something in-process.

Reproducing the issue

Memory leak tangent

This section is mostly a guide of what not to do. Please skip to the “GC size issue” section if you only want practical information.

Initially I assumed that this must be a memory leak (it wasn’t). For the record, this would have been my first time fixing a memory leak hence my jumping at solving that problem.

Under the memory leak premise, I looked at what queue jobs would be running frequently enough to warrant a memory leak. Our queue handles a few jobs:

Sending emails (e.g. welcome, account deletion)
Processing reminders (i.e. pre-interview reminder, follow-up reminder)

Our site is relatively low traffic so emails are unlikely to be the culprit which leaves reminders. I looked at the email reminder send times themselves via psql but nothing seemed regular enough to correspond to Librato. As a result, it’s likely the queue system itself or how we check/update reminders.

We use a setTimeout loop to repeatedly queue a job to check for reminders every minute. We could have used cron but figured setTimeout was the simplest solution to start with.

Since the loop runs every minute, it can add up a lot over a few weeks. To attempt reproduce the issue locally, we updated the loop to run every second.

At this point, memory usage wouldn’t change too wildly (e.g. 90–100MB) but I figured it was a slow memory leak so that must be relevant still.

I’m going to hand wave over this content as there’s nothing practical to be gained here:

Attempted to reduce problem to Redis (queue system) or PostgreSQL (queries)
Removed queries, saw memory drop and become super stable. Assumed it was due to PostgreSQL
Moved to setInterval loop directly calling queries, saw hovering same memory behavior
Used heapdump to extract dumps, load them in Chrome Dev Tools, and attempt to locate cause but this wound up being a major time drain

Eventually, I came back another day (today) and found the actual cause.

GC size issue

While taking time away from the computer, I reasoned that I should try to force GC as it might be causing that hovering memory (i.e. moving between 90–100MB instead of being fixed at 90MB).

As a result, I added node --expose-gc to my queue call and started running global.gc() on every iteration of the setTimeout loop (attribution to: https://simonmcmanus.wordpress.com/2013/01/03/forcing-garbage-collection-with-node-js-and-v8/)

I was still seeing the hovering behavior so I disabled emails and forced all queries to retrieve results (i.e. removed the where filtering). This lead to memory hovering between 130MB and 140MB but still nothing as apparent as a leak.

Then, I removed the global.gc() and finally saw the behavior in production:

At this point, it dawned on us that it wasn’t a memory leak but GC not running frequently enough. It turns out Node.js doesn’t look at the OS’s remaining free memory and GC but has a generic setup for clearing items (i.e. when process is idle and GC heap is over 1.5GB)

https://github.com/nodejs/node/issues/2738

Relevant StackOverflow

https://github.com/nodejs/node/wiki/Frequently-Asked-Questions

Resolution

With this information in hand, we set our GC size to something that won’t overwhelm our production box without having an actual memory leak (e.g. --max-old-space-size=200). Then, we verified our patch fixed the local issue:

Memory usage with `--max-old-space-size=200` after 2 minutes

After that, we opened a PR, verified tests were green, and landed the patch to production. It’s only been an hour since landing but we’ll post an update if the issue reoccurs despite our expectations.