From symptom to root cause — Hunting down a kernel bug

Andi Sieferlinger
Oct 23, 2018 · 4 min read
copyright: pixabay

Who doesn’t know this situation: Sometime something weird is going on in your setup, but everything is so vague that you don’t know where to start. This is a story about a bug we experienced in a production setup and how we found out what was the root cause.

The problem

Imagine a ECS cluster with multiple tasks running with an NFS (EFS in this case) backed persistent root volume. Launching new tasks works just fine but as soon you are doing a cluster upgrade complete hosts seem to be unable to start tasks or can only start a few and then break. But also some hosts just work fine. So scaling up or down your cluster is a really scary thing that could not be done anymore without manual intervention. That’s an unacceptable situation, but what to do?

Want to know how we handle the NFS mount? My colleague Philipp wrote a blog post on that.

Only some hosts are affected

During the upgrades we noticed that some hosts didn’t have any issues with launching tasks. But what was the common pattern? Number of tasks, availability zone, something else? As many host instances in eu-west-1c were affected the AZ theory was the first to verify. Things that could go wrong with EFS in an AZ are:

  • EFS mountpoint not there
  • Wrong security group attached to mountpoint
  • per subnet routing in VPC gone wrong
  • routing / subnet config on the instances
  • iptables on instance misconfigured
  • Docker interfaces catches NFS traffic

I checked all of them but everything was configured correctly. However, when I logged on to a host and tried to mount one of the EFS volumes manually it just hung forever.

Could it be something else network related that I did not think of? UDP vs TCP in NFS?

It’s tcpdump time!

tcpdump -v port not 443 -i any -s 65535 -w dump.pcap excludes stuff from SSM Agents that are running on the ECS instance and should really catch all traffic on all interfaces. Starting tcpdump, run the mount command manually, stop tcpdump, pcap file on s3, fetch file to analyse locally with wireshark. Digging through the whole dump I was unable to find any package to the EFS Mountpoint. Did I do something wrong? But also after the second try there was not a single package related to NFS.

The only possible conclusion here: the system does not even try to mount, but why?

Stracing things down

ps shows many other hanging NFS mounts triggered by ECS / Docker. But why are they hanging?

It’s strace time! Let's see what is going on strace -f mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $ /opt/debugmount/$EFSID &

We see that everything hangs on the mount syscall. And that for even longer than 10 minutes, which should have hit already every timeout. (In this case actually the latest timeout should have triggered after two minutes (600/10*2))

As I noticed before, there were many other hanging NFS mounts, so let’s kill them off and try again. Now the mount runs perfectly fine! Are we hitting an AWS service limit here? Is there some race condition blocking out each other? According to AWS docs, there is not such an service limit, so let’s stick with the deadlock theory and try to build a test case.

The test case above works perfectly fine. But maybe Docker is mounting the volume in a different way, so it might be that the test case is not accurate? Let’s have a look on what is happening within Docker:

All of this looks quite sane. There is some fiddling with addr option, but we also tested mounting by IP and had the same issue.

Stepping back to our initial problem: The issues only occur on a cluster update where all tasks are re-scheduled to other instances. What if that happens not sequentially but in parallel? Our simple test is just running in a loop, so one mount command runs after another.

Now we are backgrounding the mount command. So we do not wait anymore for the command to complete, but immediately start the next one. Here we go: now after a few iterations things seem to block and mounting stops to work. We now have a test case that we can use to verify if we fixed the issue.

and now

We now know that the issue has to be in the kernel and have a way to reproduce it, but I have no experience in tracing actually what is going on in the kernel here. So Google has to help out and after searching for nfs mount race condition deadlock rhel it returns a bug report.

The issue described in the report looks very similar to what we experience and one of the comments mentions that this behaviour was introduced with NFS 4.1.

So let’s change our snippet to use NFS 4.0 and see if everything is working now as expected:

Awesome. Our “quite complex to debug” issue was easy to solve. We changed the mount options in our ECS cluster and everything is running smoothly now.

The current Kernel of the AWS ECS optimized AMI doesn’t seem to contain that bug fix at the time of writing.

Scout24 Engineering

All about engineering @Scout24

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store