An adventure with a memory leak in Node

If you’ve followed any of my previous posts about what we are building at TES, you will remember that we are using a proxy to perform the composition of fragments of HTML into pages (given 70% of our traffic is from organic SEO) — all via an open source project that I talk about a lot more here:

Well … we’re now completely, 100%, entirely live with the platform built this way, and have been for a few months, and so we have some good news to report.

The performance and operation of the overall system is great — most individual services are sub 10ms response times, low memory and very reliable (hey — they only do one thing!), the overall pages average 60ms response times even under full load, and all this on significantly less infrastructure then we had previously.

To top it all off using the fully automated deployment pipeline we have done roughly 60 deployments to live since we went fully live with the eCommerce elements two weeks ago. TES previously didn’t do this many releases in an entire year, let alone two weeks. Some of these small bugs, some fully fledged features — but all just one person picking up a ticket, doing the work, pushing to live.

Feel free to find yourself something here if you’re wondering what it is:

But, like all stories of going live with a complex, multi-faceted system not everything goes right. One of those is that we observed that the composition layer was exhibiting some strange behaviour regarding memory usage after we went live with all 400k of content pages:

Not great at all — we had 7 node processes per server (8 cpus), with 16GB memory per server, and what you see above is the node processes killing themselves with OOM. This is after an earlier round where we had the memory limits within Docker set lower than the Node max heap size, and Docker was killing the processes before we got to this point.

So, we clearly had a memory leak that none of the earlier tests we had been running on any of the individual components (that appeared to not show this behaviour) was highlighting to us.

I embarked on a rapid process of soak testing each of the pieces that fit into Compoxure — the first candidate was the library that gets and then caches the fragments:

A series of commits, including some that removed functionality we weren’t currently using, ensued, along with specific soak tests that were (I thought) designed to find out what is going on with memory growth.

I even used a set of production urls to test against (a lot of them!) to make sure that my test wasn’t too synthetic and so not replicating the live environments. I wrote a little module that dumped memory every second so that I could graph it in Excel …. nada:

So — imagine repeating this process against all the pieces (I won’t bore you but you can see all the commits in Github). Nothing obvious popped out.

The final approach, which myself and Steve Cresswell from Guidesmiths agreed never to share publicly, was to go into one of the live containers, manually install heapdump, restart a worker and then manually trigger heap dumps over the next few hours.

What this delivered however was very enlightening:

1GB of heap consumed by what appears to be the HTML fragments that are being parsed by the parser, and processed to generate the final output.

Looking at retainers, the culprit became clear:

It was the template engine we were using to parse the declarations, Hogan, holding on to sections of the strings that were being processed by htmlparser2, and in so doing forcing V8 to hold on to the entire document.

The specific code that did it is here:

Which basically just caches the template so that on subsequent use it isn’t re-compiled, a very sensible thing to do for ‘normal’ use, but we weren’t using it normally. Once we went live with all of the documents, we were effectively creating a set of micro cached templates for every single document that parsed through it, and then holding the whole document.

But — why does V8 hold on to the entire document if you grab just a small chunk of it? Good question:

So … our solution was to pull Hogan into the project directly, and remove the caching. It meant nothing was ever holding on to the substr of the document and so V8 could GC it as it should. I’ll package this up into a PR to make it configurable in the next few days.

The result, note the second deployment when we figured it out:

Everyone happy. Key take aways:

  1. If you do anything with large documents in node, remember that retaining any substr of it will retain the whole document — be careful.
  2. We didn’t actually need this feature of caching the compiled template, and in some respects don’t need the full power of a templating engine to parse the declarations — I should probably have looked deeper under the hood when selecting it and may still replace it with a simpler parser.
  3. Load testing to look for memory leaks must be able to replicate real production traffic — not always easy to do.
  4. Being able to perform heapdumps in live is invaluable — we’re going to automate it in future.