Handling false memory leaks
For the past few months, we have gotten infrequent alerts from Librato regarding hitting our memory usage threshold (600MB/1024MB). Up until today, I thought it was a slow memory leak (likely due to previously seeing articles like Eran Hammer’s Walmart Labs post) but it turns out, it was we never set a maximum GC size for Node.js so it was growing in an unbounded fashion.
The rest of this article will go into how we came to our conclusion and resolution.
Collecting the facts
Upon receiving our alert, we looked at Librato to confirm the issue:
Then, we ssh’d into our server and looked athtop
to find the verdict. We forgot to take a screenshot of this but the numbers were:
- 88MB used by server processes (x2)
- 313MB used by queue process
313MB is way more than we expected for our queue process. After this, we checked the uptime via supervisor
to make sure no recent restarts had occurred which caused the memory drops in Librato:
After restarting our queue
process (i.e. supervisorctl restart app-queue
), the memory usage dropped to 82MB.
Since the queue hadn’t been restarted in 8 days and the memory drops in Librato occurred 1 and 4 days ago, we know that the issue must be something in-process.
Reproducing the issue
Memory leak tangent
This section is mostly a guide of what not to do. Please skip to the “GC size issue” section if you only want practical information.
Initially I assumed that this must be a memory leak (it wasn’t). For the record, this would have been my first time fixing a memory leak hence my jumping at solving that problem.
Under the memory leak premise, I looked at what queue jobs would be running frequently enough to warrant a memory leak. Our queue handles a few jobs:
- Sending emails (e.g. welcome, account deletion)
- Processing reminders (i.e. pre-interview reminder, follow-up reminder)
Our site is relatively low traffic so emails are unlikely to be the culprit which leaves reminders. I looked at the email reminder send times themselves via psql
but nothing seemed regular enough to correspond to Librato. As a result, it’s likely the queue system itself or how we check/update reminders.
We use a setTimeout
loop to repeatedly queue a job to check for reminders every minute. We could have used cron
but figured setTimeout
was the simplest solution to start with.
Since the loop runs every minute, it can add up a lot over a few weeks. To attempt reproduce the issue locally, we updated the loop to run every second.
At this point, memory usage wouldn’t change too wildly (e.g. 90–100MB) but I figured it was a slow memory leak so that must be relevant still.
I’m going to hand wave over this content as there’s nothing practical to be gained here:
- Attempted to reduce problem to Redis (queue system) or PostgreSQL (queries)
- Removed queries, saw memory drop and become super stable. Assumed it was due to PostgreSQL
- Moved to
setInterval
loop directly calling queries, saw hovering same memory behavior - Used
heapdump
to extract dumps, load them in Chrome Dev Tools, and attempt to locate cause but this wound up being a major time drain
Eventually, I came back another day (today) and found the actual cause.
GC size issue
While taking time away from the computer, I reasoned that I should try to force GC as it might be causing that hovering memory (i.e. moving between 90–100MB instead of being fixed at 90MB).
As a result, I added node --expose-gc
to my queue call and started running global.gc()
on every iteration of the setTimeout
loop (attribution to: https://simonmcmanus.wordpress.com/2013/01/03/forcing-garbage-collection-with-node-js-and-v8/)
I was still seeing the hovering behavior so I disabled emails and forced all queries to retrieve results (i.e. removed the where
filtering). This lead to memory hovering between 130MB and 140MB but still nothing as apparent as a leak.
Then, I removed the global.gc()
and finally saw the behavior in production:
At this point, it dawned on us that it wasn’t a memory leak but GC not running frequently enough. It turns out Node.js doesn’t look at the OS’s remaining free memory and GC but has a generic setup for clearing items (i.e. when process is idle and GC heap is over 1.5GB)
https://github.com/nodejs/node/issues/2738
https://github.com/nodejs/node/wiki/Frequently-Asked-Questions
Resolution
With this information in hand, we set our GC size to something that won’t overwhelm our production box without having an actual memory leak (e.g. --max-old-space-size=200
). Then, we verified our patch fixed the local issue:
After that, we opened a PR, verified tests were green, and landed the patch to production. It’s only been an hour since landing but we’ll post an update if the issue reoccurs despite our expectations.