Debugging Node.js Memory leak on Production via Shadowed traffic

Aurelijus Banelis
home24 technology
Published in
7 min readJan 25, 2021
Room as  background with various charts. Text: Debugging Node.js Memory leak on Production via Shadowed Traffic. At home24

Memory leaks are hard to debug, especially when using programming languages with garbage collection. At home24 we had quite a journey searching for the cause of 139 exit codes from crashed docker containers.

The outcome of the debugging was not only the fixed bug, but also an example how tools outside of the JavaScript ecosystem can help in debugging JavaScript code.

So let’s start the journey…

In the beginning, it was just a 5xx status codes

Number of crashed docker containers via AWS CloudWatch metrics

In a Web service world, there is a convention to respond with 500–599 HTTP status codes when there is a nonrecoverable problem in the application itself. Therefore most popular Node.js applications (e.g. Express) and AWS Load balancers also implement this behavior.

What cannot be found and fixed via Unit tests while still in development, ends up in 5xx exit codes in a live environment. For code logic errors — it is repetitively easy to fix because those are deterministic: with the same input and context, we should get the same error. But there are other issues like failed network, hardware and… insufficient memory management. We still get the same 5xx error codes but fix range from wait-and-retry to long story ahead.

Memory leaks in a docker container world

Example of memory leak issue delegating to other teams

If you are coming from bare metal Node.js applications, you would probably say “just restart pm2 and it is fixed”. But it does not work in the world of containerized applications because:

  • There is ecs-agent, kubernetes, docker or other hypervisors that would restart the application instead of pm2
  • restarting application mitigates outcome, but we are still paying for more CPU or memory of a not optimized application

In home24 we are trying to innovate fast and take advantage of the best programming language for the task, so we are using docker extensively. At first, we tried to increase the memory of the containers (think bigger servers), but it just postponed the issues.

Advertisement about discounts before Black Friday weekend

Black Friday (biggest sale of the year) was coming, so “just postpone” was not a viable solution. Frontend team was already filled with tasks of creating new features and fixing “reproducible” errors, so memory leak issue was delegated to Scaling team —a team that was more familiar with AWS infrastructure than with internals of Node.js/JavaScript.

Brace yourself, memory leak hunter is coming

Slide from presenting bug hunting internally at home24 Demo night

Coming from other programming languages, Scaling team assumed that unused memory would be released quite soon. But even with a Hello world Express application after no requests memory usage did not drop. So we realized that JavaScript memory management is more complex than we anticipated. So as a Scaling team we went from different debugging tools (E.g. Chrome inspector, clinic, netdata, NewRelic)to different debugging methods (E.g. building locally, running production containers, comparing logs of similar services).

As usual in these types of bug hunting — the best ideas came in the middle of midnight. Marius saw that there are a lot of HTML strings persisted in the memory when trying to reproduce locally. He googled for similar issues in JavaScript community and guess what — there was a known memory leak in the popular axios library:

Workaround for axios and HTML content

It was used in 3rd level dependency, so rebuilding (minifying) all upstream dependent services took some time. We kinda reproduced it locally and saw fewer retained strings via Chrome inspector. Therefore we were already excited that we managed to find the bug. Finally, we released the application with the fix and…

Release date of new service version, but no visible decrease in errors

there was no significant impact.

We were going out of options, so we had to accept the defeat. To be honest it was Scaling not the Frontend team.

But the story does not end here…

Meanwhile migration to Fargate

The task to migrate Node.js service on new infrastructure

As part of Scaling team tasks, there was a migration from AWS EC2 based ECS to Fargate based ECS. From practice, we saw that we are not good at optimally choosing the right servers (EC2 instances) and right auto-scaling parameters. Therefore, for stateless Node.js application, it made sense to migrate.

To get the right container sizes for production load we used a technique called shadowed (mirrored) traffic:

Traffic shadowing

Traffic shadowing is a quite common technique in Cloud-native tools like Skipper or Istio. It is similar to load testing, but instead of manually describing requests, examples of real ones are directed to the copy of the service. While our Node.js applications were stateless and dependent services could handle 2x traffic — the migration plan seemed perfect… until we saw the response time chart:

Response times more unstable on Fargate

But wait — why a Response time chart, how it is related to memory usage?.. It is not only garbage collection calls that make application slower. The load balancer also does not remove unhealthy containers instantly. So clients (services or browsers) are still waiting for valid connection close packets until it times out.

It turned out that Fargate was starting all containers at the same time, so memory was filling also at the same time. To our disappointment, the memory leak issue was more visible on Fargate than on EC2 instances.

Slide from presenting bug hunting internally at home24 Demo night

Scaling team already failed in trying to fix a memory leak issue before. If other similar Node.js apps would also have a similar memory issue, the whole migration to Fargate goal would be at risk.

It seemed that the Scaling team had just bad luck, unless…

Introducing: Debugging on production

Connecting to remove Node.js container via Chrome Inspector on a shadowed instance

The Situation was different. We had an environment that did not affect real users but had the best test data possible (mirroring all LIVE traffic). So we could use all heavy JavaScript debugging tools.

It turned out that Chrome inspector (DevTools) was adding a lot of overhead, so for Heap snapshot and Allocation instrumentation on timeline containers were crashing in seconds (meaning connection reset errors in Chrome). So connecting to “real production” would be catastrophic (or at least not professional).

Chrome Inspector options for memory profiling

From all profiling types, only Allocation sampling survived ~10 seconds until the container was marked as unhealthy (note: we were using container sizes that had same response time as original ones). And it was enough to see the real issue: docCache variable.

Also, this setup allowed to remove false-positives fairly quickly (E.g. is it related to Node.js version, Apollo library version, hooks, caching configuration, etc).

Slide from presenting bug hunting internally at home24 Demo night

Finally, we saw a happy ending in this debugging story…

The real cause: string interpolation instead of Apollo variables

After assumption was tested with a traffic of real users, we finally could confirm the real cause of the issue:

Illustration of the issue and the fix

Of course, this is a simplified example. But when you are writing a lot of:

) @include(if: $withOptionalBackendParameters)

It gets tempting to replace those with some simple generated strings.

To summarize, at the start of debugging:

Lessons learned

We could say that Scaling team learned some new JavaScript tricks, but personally, I see it from a wider perspective:

For complex problems, cross-team effort gives more unique tools and views

If not the unique insights from multiple team members — I doubt I could write a happy ending for this debugging story.

Therefore I want to publicly give big thanks to the whole Scaling and Frontend teams (not excluding Tomas, Marius, Džiugas, Danny, Olga, Robert, Karolis, Antonella), as well as many other colleagues from home24.

--

--