Demist the Memory Ghost

Bob Zhou
7 min readJan 27, 2020

--

It has been a few months since the last long hours spent on the memory of the NodeJS application server. I didn’t expect the story to be continued and hoped it wouldn’t. However, ignorance is something that sticks with you forever unless you take an effort to drive it away. The old memory problem came back.

The following memory curve, reported by ECS, has been repeating in the past a few months. There had never been any performance degrade or automatic system restart. The memory jumps to about 80% and becomes stable. After stabilization, during the peak time, it jumped higher, then went back not too much, then jumped high again. I thought it was just the GC of NodeJS working lazily. Since there is no impact on the live system, no more time was spent on this

Growing memory, then steady

Until one release, it becomes like this. The memory went up much faster, reached almost 100% and then went back to just 30%. (The very low end was a few releases). This got my nerve! There must be something WRONG with my release!

Inspect the release

The first thing to do, of course, is to see what’s wrong with my code. The only change was to add more lines of logs and refactor of using more res.local from ExpressJS. I relanded on the console.log issue in my memory debugging months ago!

Going back to refreshing my V8 GC didn’t help too much in explaining the soaring memory usage at all. Could the retention of more variables in res.local, which is kept for a longer time (like 2–3 seconds), delay the garbage collection? Or, the more logs written in stream to the console, which is then captured by the awslogs driver of Docker, leaked?

Reading more articles, Stackoverflow, explanation of how V8 GC works, didn’t lead me to any luck. I tried to stress test the dev server, and the memory curve couldn’t be more normal like the following:

After a spike when I stress test, it goes back to normal.

It left me no choice to test the log leak on live.

Well, the memory curve changed nothing after turning the log driver of the docker image to none, so obviously it couldn’t be the log driver issue. Could it be the console.log problem? If you search console memory leak, you will find a number of GitHub issues. Well, I simply change the console.log to an in-memory list of log strings, which is uploaded to CloudWatch log stream every 2 seconds, and cleaned up.

Nothing changed.

The final test is to add customized reporting of the memory usage from the Node app itself, that I start to send the rss reported by memoryUsage() to CloudWatch. The magic happened. RSS memory stayed stable all the time. There is no leak at all!

The yellow line is the rss reported from node, while the blue line is the docker stats reported by ECS agent

Debugging Docker

One day I realized, that on the same day (a few hours later) of the release, I upgraded the AMI for the live server, which upgraded the docker from 18.06 to 18.09.9, and probably the kernel too. This and the previous experiments guided me to answer the question of why docker reports different memory than NodeJS. As the concept of “leak” is already primed, I started to tirelessly search for docker related leak. There are many, but all turned stale, closed git issues of really old docker versions. The one used by ECS is 18.09.9-ce, which is pretty recent. The reported bugs had all been fixed in the prior versions. Deadend.

How interesting it is to learn how docker stats works! Well, to correctly report the memory, docker uses the memory reported by the cgroup memory usage. Until recently, it was fixed that the correct memory should be the memory used to subtract the cache. To understand this, there are a few concepts on how docker works

  • Docker uses Control Group, cgroup, a feature offered by Linux kernel, to isolate and limit the resources. It is setup when you pass cpu or memory limits to docker run.
  • In the case of cgroup memory, memory is limited by soft/hard limits. The usage is also reported in a number of places: memory.stat, memory.usage_in_bytes, and a number of memory.kmem reports (the kernel memory)
  • Linux caches file reads and writes for better performance, It is usually accounted for the buffer numbers when you run free command. The purpose is to put file content in the memory so the next time don’t need to read from disk. When applications need more memory, Linux just releases the buffer for them
  • Docker reports the memory.usage_in_bytes, minus the cache reported in memory.stat

SSH into the servers and observing the memory growth needs patience. After a number of logging across a few days, the facts revealed

  • By top command, node applications have no leak. The memory rss is stable all the time
  • By free -m command, the available memory is stable all the time

This led me to pay more attention on the docker behaviour, what have been observed

  • memory.usage_in_bytes keeps growing
  • cache in memory.stat is neglectable, like 4096 bytes
  • rss in memory.stat plus cache, didn’t add up to memory.usage_in_bytes

What! Wait! How could that happen! … Yet it led to another night of search and wondering.

Until I started to pay more attention to memory.kmem, the kernel memory usage.

Finally, Kernel

I never thought I would do anything with Linux Kernel in my life. I take for granted that it is blessed by Linus Torvalds and other geniuses so I just need to write application code. Docker/Container, too, is another piece of art that is taken care of.

Still, they are taken care of very well, but there is no free lunch. Anyone who wants to play with fundamentally more technology has to understand the tech.

What I observed, is that the kernel memory memory.kmem.usage_in_bytes is particularly high, and growing at the exact speed of the cgroup total memory usage. In other words, the growth of the cgroup total memory is from the growth of the kernel memory of this cgroup!

But, why did the kernel memory consumption grow? Is there a leak in the kernel? (which led to a number of issues, posts, without good UI (like Linux bug reports).

After the painful search, I came across the way to empty the memory buffer, from the host, by echo 3 | sudo tee /proc/sys/vm/drop_caches,(in fact, echo 2 is enough) which is non-destructive, so I just ran it in the live server.

Both the kernel and app memory went back to where they should be…

The Whole Story Retold

  1. The NodeJS application creates and deletes files for each request it handles, with ffmpeg for audio conversion. The creation of the file will add cache to dentry, which is the directory entry, a Linux struct of the directory information.
  2. dentry is cached in the kernel memory, in something called slab, which is yet another way to more effectively make use of the memory by the OS. However, unlike filecache which is cleaned when the file is deleted, dentry cache is NOT cleaned, for a reason that the file deletion might not always succeed.
  3. With hundreds of thousands calls per day, the dentry cache quickly fills in the kernel memory, which is reflected in the cgroup memory. That amount of memory is not reflected in the cache (which could be something to improve), so it is never subtracted from the memory usage report.
  4. When the cgroup detects the memory pressure (in my case, 512mb as soft limit), it triggers the proper recycling on kernel memory and the number then plummet to the beginning.
  5. So, the conclusion is the frequent new file creation caused all this, which is harmless at all

However, it still looks ugly, what options do we have

  • Clean the memory periodically. This is even uglier.
  • Do not create new files with ffmpeg. It is possible to handle the audios totally with pipes and get the buffer output. I almost finished the code refactor, then failed the test on ‘mov, m4a’ -like file formats. The pipe doesn’t work because you need to have all the data accessible for conversion. Newer ffmpeg has movflags for faststart, which is to put the necessary information at the beginning of the file. But I am not yet ready to compile newer ffmpeg from source, as Debian’s package is still old. Or using qt-faststart, which doesn’t support pipe yet…
  • Live with the existing memory metrics, and report the real memory usage with custom CloudWatch metrics. Because you still need to check memory usage, what if there is a REAL leak in the future. This is what I will do

--

--