Debugging Node.js

How we found memory leaks and slow/infinite loops


At Storify, we use Node.js to power our backend.

For most problems in production, like uncaught exceptions, we debug using our logs. However, we’ve seen several bugs, such as memory leaks and infinite loops, which are impossible to track down without debugging tools.

In the past, we used Monit to restart our services when they became unhealthy. When we moved to the Livefyre infrastructure, we decided to stop doing automatic restarts, and start fixing bugs. Over the past few months, we have learned a lot about debugging in Node.js.

Memory leak

A few months ago, we knew that we had a memory leak, which was forcing us to restart our servers frequently. Finding it seemed impossible, since the bug could be in our own code, or any of the ~60 npm modules we were using.

We started looking on Github for any known issues. We stumbled upon a thread which told us there was a memory leak in Mongoose. By chance, we had discovered the source of the leak. Unfortunately, the bug had been reported, but not fixed.

Our coworker @devongovett decided it was time to fix the bug, but he didn’t know where to start looking.

Devon created a server which reported Node.js memory usage. His first discovery was that the objects were left over after a query was run. The next experiment was to comment out various parts of Mongoose looking for the leak. After a few tries, he found the culprit.

“I think I got lucky really ;-)”

Our experience fixing this bug convinced us that we needed better debugging tools to target our debugging.

CPU wedging

Occasionally our service would lock up for several minutes at a time.

CPU usage of one server

@tjfontaine,the project leader for Node.js, works on debugging tools, and he recommended we try out MDB. After reading this great article, I started playing around with this tool.

First try

I used gcore to get a first core dump, uploaded it to Manta and started analyzing. My first trace had a valid stack, but I couldn’t understand what the process was doing.

[ 00007fffe4967160 0x7f7a24ced619() ]
00007fffe496a1d0 uv__io_poll+0x148()
00007fffe496a230 uv_run+0xd8()
00007fffe496a2b0 _ZN4node5StartEiPPc+0x150()
0000000000000000 0x7f7a24c1e76d()

I just caught it when my process was doing nothing.

Node.js was doing its job by waiting for an event to happen.

The trick to using gcore is to capture the dump while your program is in a broken state. There are few options to get a useful trace:

  • Induce a core from inside a specific code path by adding
process.abort()
  • Induce a core when there is a uncaught exception by using
node --abort-on-uncaught-exception app.js
  • Kill the process server when CPU is wedging by using
kill -ABRT $pid

Our first useful trace

Using the kill -ABRT method, I was finally able to get a good stacktrace.

I’ll focus on the important part, but you can find the full trace here.

> 00007ffffd09c0c0::jsstack
[...]
7ffffd09c7c0 0x3bdddc420285 inlineContent (e8477f4cc01)
7ffffd09c810 0x3bddc784e83d juice (e8477f4c9b9)
[...]
7ffffd09c948 0x3bddc4fb80f0 renderWrapper (270dfadc0769)
[...]
7ffffd09cdb8 0x3bddc7f2858b forEach (28a6e1015579)
[...]
7ffffd0a0ac0 node::Start+0x172

From this trace, we started looking at Juice, which we use to inline CSS when building RSS feeds. We use this module only once in our Webapp code. Here is a sample:

async.forEach(stories, function(story, cb) {
res.render(‘story/_minimal’, {
//…
}, function(e, html) {
if (e) return cb(e);
var description = juice(html, someCss);
// …
});
}, callback);

At this point I was able to reproduce the hang locally. Further investigation pointed the finger at Jsdom, which was slow for large HTML documents. Rather than fix Jsdom, we started looking for a faster module. Using Cheerio, Styliner was able to process a job in only 3 seconds, which had taken Juice more than 5 minutes.

Conclusions

I wish we had this amazing tool for the Mongoose leak. Now, we can target bugs directly, without guessing.

Some bugs are impossible to discover on a non-production environment. It’s important to have tools in that environment that allow you to understand what your program is doing.

Today we use Monit to automatically collect core dumps before restarting a process:

stop program = “/bin/bash -c ‘/bin/kill -ABRT `/bin/cat <%= processpidfile %>`’”

And then we link Monit logs to Hipchat. Now we know when a restart happens, and we get a coredump for analysis.

Hopefully, this will help you to have a base code that is more robust, and so less painful for your users.


Special thanks to @andrewguy9 for vetting my post.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.