The case of the missing free space

Aron Fyodor Asor
4 min readApr 3, 2017

--

Our workplace has a beefy, all around server we named vader. We run not just our builds, but also our content and data pipeline gets run in there too.

One day our builds on all branches started failing. When I tried SSHing to vader, I got No free space left on device upon logging in. Looks like our disk space really is full! I need to remind myself to create a cron job that clears that out, I thought at the time.

I did a df -h to confirm that we don't have any space:

aron@vader:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 8.0K 16G 1% /dev
tmpfs 3.2G 1.6M 3.2G 1% /run
/dev/sda1 886G 411G 430G 49% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
tmpfs 1.0G 4.0K 1.0G 1% /tmp
none 5.0M 0 5.0M 0% /run/lock
none 16G 148K 16G 1% /run/shm
none 100M 32K 100M 1% /run/user
/dev/sdb1 5.5T 559G 4.6T 11% /data

We actually have a lot of free hard drive space! But why would it give that error? Could it be that a process is trying to create a really big file?

I ran touch test.txt to see if I can create any file at all. That creates a 0-byte file in the current directory. And the results are:

touch: no free space left on device

We couldn’t even create a 0-byte file. I started to think that the No free space error was misleading, because we might've run out of another resource.

I remembered from OS class that filesystems record file metadata in internal entries called inodes. After googling around to find out how many inodes are free, I arrived at the df -ih command, which lists out all used inodes in a human friendly format.

Running that command gives:

aron@vader:~$ df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 4.0M 571 4.0M 1% /dev
tmpfs 4.0M 686 4.0M 1% /run
/dev/sda1 57M 57M 0M 100% /
none 4.0M 14 4.0M 1% /sys/fs/cgroup
tmpfs 4.0M 7 4.0M 1% /tmp
none 4.0M 3 4.0M 1% /run/lock
none 4.0M 10 4.0M 1% /run/shm
none 4.0M 18 4.0M 1% /run/user
/dev/sdb1 175M 2.5M 173M 2% /data

We did run out of inodes on sda1, the root partition! I tried googling on how to increase our inode count, but looks like it can't be resized for ext4 filesystems (which is what vader has).

I had no choice but to find the files and reduce the number of inodes we were using.

Let’s review the facts to narrow our search:

  1. Since my disk space isn’t full yet, these files would not be big at all — in total, they’re less than 400 GB.
  2. However, the root partition has 57 million total inode space. And we consumed all of that.

What we’re looking for then, are folders with a lot of small files.

I tried using find -xdev / to list all the files, and then count them from there. The -xdev flag makes sure that we don't cross mountpoints and only scan folders under the same mountpoint (/dev/sda1 in this case).

But that didn’t really seem to give me a useful indication, as the top result gave dpkg's pkginfo directory, where all the package metadata is stored.

In parallel to this I was looking into how to increase the root filesystem’s inode count. There's a couple of options I can do on the filesystem side:

  • Reformat to Btrfs: Btrfs is Linux's replacement to the much lauded ZFS filesystem from Solaris. Not only does it have cool features like snapshots, checksums and online volume growth, but also has dynamic inode counts. It can grow its inode table in response to a larger amount of files. Btrfs can also replace an existing Ext4 filesystem without losing any data.
  • Reformat to Ext4, but give a larger inode size: Ext4's inode table size is set during the initial reformat and can't be changed unless we re-reformat the filesystem. To execute this, I'd have to copy over all the important data to another partition (hard to say as it's a shared server), reformat the root partition, and copy the data back.

The first option was more palatable, due to not needing to figure out what’s important and what’s not. Although it still does have its risks that are inherent to reformatting filesystems, even if they’re in place.

As i resigned myself to doing the risky move of reformatting the root partition to btrfs, I brought up the problem to our Slack channel. Jamie (one of the awesome devs) took a look at it, and after some time, told me to look at /var/lib/docker.

And true enough, running find -xdev /var/lib/docker | wc -l took a long time, and returned a total count of around 56 million files! That's 99% of our inode count right there.

I asked him how he did it, and he said he just ran sudo find / and saw that it ran a really long time in the /var/lib/docker folder.

After identifying that docker’s containers were the problem, I simply ran docker container prune -f to remove all finished containers.

The results:

Total reclaimed space: 192.6 GB

aron@vader:~$ df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 4.0M 573 4.0M 1% /dev
tmpfs 4.0M 705 4.0M 1% /run
/dev/sda1 57M 1.3M 55M 3% /
none 4.0M 14 4.0M 1% /sys/fs/cgroup
none 4.0M 3 4.0M 1% /run/lock
none 4.0M 10 4.0M 1% /run/shm
none 4.0M 18 4.0M 1% /run/user
/dev/sdb1 175M 2.5M 173M 2% /data
aron@vader:~$

Yey! We’re back to less than 3% of our inodes consumed.

Lessons learned:

  1. Docker can generate A LOT of files if left unchecked. We either need to do a cron job that garbage collects all the docker containers left after running a build, or optimize our dockerfiles to reuse as many intermediate containers as possible.
  2. This was the first time I’ve ever dealt with inodes running out. Looks like the symptoms are a No free space left on device error, but still having free space left.
  3. Sometimes, you don’t really need fancy tools and programs to identify problems. Simple tools and 👀 is all you need sometimes.

--

--