How to debug ENOSPC on AWS Lambda ?

Yury Michurin
Wix Engineering
Published in
5 min readFeb 25, 2020
Failed to get b1f25e2.tgz: ENOSPC: no space left on device, write

My journey started with the error above on our AWS Lambda function, which we use to run ~900 end to end test suites in parallel in 5 minutes (that’s a story for another time though).

Where did the disk space go? How can we figure that out?

Broken HDD, Markus Spiske

First let’s take a step back, and refresh our memory on what is AWS Lambda and how it works.

So what is AWS Lambda?

Lambda, is a “serverless” technology offered by Amazon as part of their cloud services. I won’t go too deeply into it, but essentially it’s a container based on 64-bit Amazon Linux AMI, running a runtime, in our case — node.js. This container running a predefined function with an agreed upon interface, you — the developer, are billed by the amount of time and memory it took to run the function.

Side by side running lanes, Tim Gouw

When we invoke a serverless function, there are typically two start types: “Cold” (more here) and “Warm”, for the purpose of this post — cold start is when a container starts from scratch. A “Warm” start, is when a container was already started, and executed the function before, it just needs to be “unfrozen” and passed the new request.

So, in our case, the “warm” AWS Lambda function container was running out of space after a while. But why? We have a cleanup code that wipes the /tmp directory after we finish executing our logic.

So how can it be we’re running out of space?

First, since that’s a problem hard to reproduce locally, I wanted to add some more information to our log output from the Lambda. The first thing was added is a simple df -h output:

And that’s the actual output I got:

Filesystem Type Size Used Avail Use% Mounted on
/dev/root ext4 7.9G 5.4G 2.5G 69% /
/dev/vdb ext4 1.5G 19M 1.4G 2% /dev
/dev/vdd ext4 526M 515M 0 100% /tmp
/dev/vdc squashfs 86M 86M 0 100% /opt

Indeed, df claims /tmp is out of space! But what’s taking all this space? We need to look deeper!

A deeper dive

The next obvious step (for me) was adding output from du, to figure out what files have been left behind that taking up all this space. I’ve added:

To my surprise, at the same time df was reporting 100% of 512MB /tmp was taken, du reported:
194M /tmp.

How can that be I wondered? Maybe AWS is using some file system that retains some journaling or metadata even though files are deleted? Maybe it’s an issue with AWS Lambda? Maybe there are hidden files from du that took up space?

I was stuck! I turned to google, our saviour, for rescue, and indeed, I was not disappointed!
I’ve found this serverfault answer in which KHobbits suggested that deleted files in linux still takes up space if a process is holding an open FD (File Descriptor) for it.

Let’s confirm that by looking on unlink man page, which is what’s being called by rm (and other delete options):

unlink() deletes a name from the filesystem. If that name was the last link to a file and no processes have the file open, the file is deleted and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have the file open, the file will remain in existence until the last file descriptor referring to it is closed.

(some more info can be found in this stackoverflow answer)

That’s great, sounds like I found the issue! But wait, how do I know which process is actually holding the file descriptor to those files?

KHobbits, in his serverfault answer also suggested a way to figure that out by using the (great) lsof tool:

lsof | grep “/tmp” | grep deleted

Sounds easy, lets run it inside the Lambda. Luckily, we can even try that locally via a docker image:

Oh no! AWS Lambda images don’t have lsof installed! What can we do? There are two options:

  1. Provide a statically built lsof binary via the function bundle or via layer.
  2. Find an alternative way to determine that, maybe do what lsof does.

I’ve opted for the former, and relied on the proc filesystem (which is probably what lsof does), from it’s manual:

The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc. Typically, it is mounted automatically by the system…

Procfs to the rescue

In a nutshell, what that means is that we can access various kernel information by just reading a file, and it’s pretty easy to read files in nodejs.

For example, we can see various information about system’s CPUs by:

cat /proc/cpuinfo

Or information about some process, for example it’s memory usage, we can look at:

cat /proc/1/status

What interests us though, is which process holds the deleted FDs.

I’ve started by adding a simple printout of all open FDs when I get ENOSPC:

Getting a process’ FDs is just going over the directory: /proc/<pid>/fd, full code is here.

After printing that out, the culprit was obvious:

fd: /tmp/.org.chromium.Chromium.mdHCbI (deleted) size: 16 MB
fd: /tmp/.org.chromium.Chromium.OYSZ4y (deleted) size: 144 Bytes
fd: /tmp/.org.chromium.Chromium.8ZPGcz (deleted) size: 1 MB

fd: /tmp/.org.chromium.Chromium.ew8wLC (deleted) size: 24 MB
fd: /tmp/.org.chromium.Chromium.OYSZ4y (deleted) size: 144 Bytes
PID: 244 EXE: /tmp/chromium TOTAL FDs size: 604.82 MB

Chromium’s process was holding references to the files even though we’ve deleted them. It now became obvious that there was a race condition between extracting new files on a new run of the function and exiting the chromium process from the previous run. That’s why the problem wasn’t always present, we had to hit the race condition AND have enough disk space usage for it to matter.

Fixing that is pretty straight-forward, kill all the process (except the function’s node.js process of course), then delete all the files (based on the code here):

Based on code here

We already had a similar code, but it was running after file extraction, so we just moved it to run before the extraction of new files to ensure the disk space is indeed cleared.

My takeaways

  • proc filesystem is a great tool that can help a lot with debugging issues with zero dependencies, keep it in mind.
  • Sometimes it’s easier to go to the source (procfs), instead using the abstraction (lsof).
  • AWS Lambda, or “serverless” in general, is, as all abstractions — leaky and requires knowledge of it’s inner workings.

--

--