From symptom to root cause — Hunting down a kernel bug
Who doesn’t know this situation: Sometime something weird is going on in your setup, but everything is so vague that you don’t know where to start. This is a story about a bug we experienced in a production setup and how we found out what was the root cause.
Imagine a ECS cluster with multiple tasks running with an NFS (EFS in this case) backed persistent root volume. Launching new tasks works just fine but as soon you are doing a cluster upgrade complete hosts seem to be unable to start tasks or can only start a few and then break. But also some hosts just work fine. So scaling up or down your cluster is a really scary thing that could not be done anymore without manual intervention. That’s an unacceptable situation, but what to do?
Want to know how we handle the NFS mount? My colleague Philipp wrote a blog post on that.
Only some hosts are affected
During the upgrades we noticed that some hosts didn’t have any issues with launching tasks. But what was the common pattern? Number of tasks, availability zone, something else? As many host instances in
eu-west-1c were affected the AZ theory was the first to verify. Things that could go wrong with EFS in an AZ are:
- EFS mountpoint not there
- Wrong security group attached to mountpoint
- per subnet routing in VPC gone wrong
- routing / subnet config on the instances
- iptables on instance misconfigured
- Docker interfaces catches NFS traffic
I checked all of them but everything was configured correctly. However, when I logged on to a host and tried to mount one of the EFS volumes manually it just hung forever.
Could it be something else network related that I did not think of? UDP vs TCP in NFS?
tcpdump -v port not 443 -i any -s 65535 -w dump.pcap excludes stuff from SSM Agents that are running on the ECS instance and should really catch all traffic on all interfaces. Starting tcpdump, run the mount command manually, stop tcpdump, pcap file on s3, fetch file to analyse locally with wireshark. Digging through the whole dump I was unable to find any package to the EFS Mountpoint. Did I do something wrong? But also after the second try there was not a single package related to NFS.
The only possible conclusion here: the system does not even try to mount, but why?
Stracing things down
ps shows many other hanging NFS mounts triggered by ECS / Docker. But why are they hanging?
strace time! Let's see what is going on
strace -f mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $EFSID.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$EFSID &
We see that everything hangs on the
mount syscall. And that for even longer than 10 minutes, which should have hit already every timeout. (In this case actually the latest timeout should have triggered after two minutes (600/10*2))
As I noticed before, there were many other hanging NFS mounts, so let’s kill them off and try again. Now the mount runs perfectly fine! Are we hitting an AWS service limit here? Is there some race condition blocking out each other? According to AWS docs, there is not such an service limit, so let’s stick with the deadlock theory and try to build a test case.
The test case above works perfectly fine. But maybe Docker is mounting the volume in a different way, so it might be that the test case is not accurate? Let’s have a look on what is happening within Docker:
Docker CE. Contribute to docker/docker-ce development by creating an account on GitHub.github.com
Moby Project - a collaborative project for the container ecosystem to assemble container-based systems - moby/mobygithub.com
All of this looks quite sane. There is some fiddling with
addr option, but we also tested mounting by IP and had the same issue.
Stepping back to our initial problem: The issues only occur on a cluster update where all tasks are re-scheduled to other instances. What if that happens not sequentially but in parallel? Our simple test is just running in a loop, so one mount command runs after another.
Now we are backgrounding the
mount command. So we do not wait anymore for the command to complete, but immediately start the next one. Here we go: now after a few iterations things seem to block and mounting stops to work. We now have a test case that we can use to verify if we fixed the issue.
We now know that the issue has to be in the kernel and have a way to reproduce it, but I have no experience in tracing actually what is going on in the kernel here. So Google has to help out and after searching for
nfs mount race condition deadlock rhel it returns a bug report.
The issue described in the report looks very similar to what we experience and one of the comments mentions that this behaviour was introduced with NFS 4.1.
So let’s change our snippet to use NFS 4.0 and see if everything is working now as expected:
Awesome. Our “quite complex to debug” issue was easy to solve. We changed the mount options in our ECS cluster and everything is running smoothly now.
The current Kernel of the AWS ECS optimized AMI doesn’t seem to contain that bug fix at the time of writing.