Too many open files: Tracking down a bug in production

This article describes the process of tracking down a production bug in our primary web service here at Room Key. It assumes some familiarity with Clojure, the JVM, Amazon Web Services, and Linux, but links will be provided for more information on certain topics.

The Problem

Our web service is a Clojure uberwar served by Tomcat and deployed across several AWS instances using Amazon’s Elastic Load Balancing (ELB) and Auto Scaling services. A few months back we noticed that after a few days of normal operation, an instance would stop responding to requests with the error java.net.SocketException: Too many open files. What this means is that our program tried to open a network socket, but had already exceeded the open files limit imposed by the OS. There are several ways to configure this limit, such as limits.conf or ulimit. We need to determine why the process is hitting this limit: is it because a spike in traffic is causing an increase, or is there a bug that is causing our program to leave files open when they are not being used? On Unix-like systems, everything is a file, so we're going to have to consider any place that we perform I/O, such as reading a file from the filesystem, accepting an incoming web connection, performing an outgoing web request, or running a process in the background and reading the output. Reviewing every I/O operation in our code would take a while, and even then it might not be obvious where (or when) the bug occurs, so it would be ideal to capture the system in its malfunctioning state.

The AWS ELB/Auto Scaling stack requires that an unhealthy instance be removed from service and terminated almost immediately after failing a health check. It would be ideal if we could somehow monitor the count of open files that our process is using, and receive a notification after some threshold. Let’s examine how we might do that.

Metrics

We’ll start by looking at a healthy production instance of our Tomcat process, which is the only process running as the user tomcat. We can get the process id with this command:

We can use the process id to get the open files limit for this process from the proc filesystem:

We could increase this limit, which is admittedly on the low side, but historically we have never bumped up against it during peak loads. Increasing it to a larger value would probably fix the problem in the near term (perhaps permanently for a large enough value), but it also hides the fact that our process may be leaking system resources. For now we’ll leave the limit as is, and figure out a way to monitor our open files usage. We can get a count of the open files currently in use as long as we have the proper permissions:

Let’s put it in a bash script which uses the AWS Cloudwatch CLI to save this as a single data point on our metrics dashboard:

We obtain the AWS instance id from the instance metadata, since we will need to identify the instance when there is a problem. AWS Cloudwatch metrics can be recorded at a frequency of 1 per minute. This makes the cron scheduler a nice tool to handle the reporting, since tasks can be scheduled at the same frequency. To do this, create a file (name it anything) in the /etc/cron.d directory with the following format (assuming we have saved the above bash script as /usr/sbin/mon-tomcat-of and set the executable bit):

Now we’ll have constant metrics tracking the number of open files our service is using. This is useful for analyzing trends in our file usage during peak and off-peak hours. Here’s a look at a week’s worth of metrics for a normally operating instance:

We never go above about 1500 open files for an entire week. This week happens to occur during the Thanksgiving holiday in the US, so it gives us some nice variation from holiday traffic (lots of people offline) to the following Monday.

The above instance actually operated normally for about two weeks without experiencing any “too many files” errors, while several other instances in our scaling group did become unhealthy during this period. Here is a chart of an unhealthy instance:

Note that the number of open files increases in a stair-step line every six hours until we hit the 4096 limit. This is rather suspicious, but we’ll come back to that later.

It is rather impractical to try to manually track this metric across every instance of our service, so we’ll need to configure an alarm to notify us when the problem occurs.

Alarms

AWS provides a tool to create alarms that notify you (via SMS or email) when a metric reaches a certain threshold. Again, this is not something we want to configure manually, since we may be rolling out a new set of instances several times a week. Instead we want to create the alarm when an instance comes online, and delete the alarm when the instance is terminated. I created an init script to handle creating/deleting the per instance alarms. The important section is below:

Again, refer to the AWS Cloudwatch CLI for more info. I set the open files threshold to 2100 to give myself a few hours to investigate before hitting the 4096 limit. I also used an SNS topic configured to send me an SMS message when the alarm is triggered. Hopefully I won’t be asleep.

About a week or two after deploying the alerts to production, I got my first notification text. Unfortunately it happened on a Saturday evening right as I had started to cook dinner. Still, I couldn’t resist my curiosity, so I took a minute to attach the Yourkit profiler to record as much JVM activity as I could before the instance became unhealthy. When I had a bit of free time I was able to look at the data the profiler recorded. The first thing I looked at was the Events tab, specifically the “Socket Connections” section. I noticed a large number of open socket connections used by our asynchronous HTTP client, although it was difficult for me to get an exact count from the UI. A few days later, the problem occurred again, so I SSHed to the instance and took a closer look at the types of files that were open by using the -l option when listing the files in procfs:

While I observed a number of sockets open, I saw many more files with names like anon_inode:[eventpoll] and pipe:[23595]. A bit of searching led to this Stack Overflow discussion: Why is the JDK NIO using so many anon_inode file descriptors?. At this point I was pretty confident the problem was related to our (Netty/NIO-based) async HTTP client.

The Code

When you find a new bug in production, the obvious place to look is the most recently deployed code. The team looked at the commit log for the latest release, but nothing really jumped out at us. I recalled from our daily standups that within the past few months we updated some of the code and dependencies for making asynchronous HTTP requests. Rolling back to the previous version wasn’t really an option because the new version fixed an existing bug. I was also curious about what was causing the six hour periodic characteristic that we saw in the chart above. I grepped the code for six hours in milliseconds (21600000) with no results, but then while scanning our namespace that handles HTTP requests I saw:

It turns out we have a task scheduled every six hours that retires all of our existing HTTP clients! Even better, we actually had some logging statements (off by default) that described when a client was marked for being retired and when the client’s connections were actually closed. We use a Clojure agent as a queue to store the set of HTTP clients that were marked, and then call close on each one. With logging turned up for this namespace, I could see that clients were being queued up, but never closed. I connected to the running instance via an embedded nREPL server just to get a bit more info on what was going on. I could see that the HTTP client at the head of the queue had been marked for retirement hours ago, but had not been closed yet. This also corresponded with the beginning of the stair-step shape in our AWS CloudWatch chart above. This was the actual source of our leak. After checking the commit history of the namespace, I discovered that the code for retiring HTTP clients was refactored when the dependency was updated, reusing a single agent instead of creating a new agent to serve as the queue for each cycle. While this was not the ultimate cause of the leak, it accelerated the exhaustion of our open files limit in production.

The Cause

I needed to connect with the Yourkit profiler again to see why the HTTP client at the head of the queue was not being closed. It shouldn’t be hard to find the thread that is hung because it will be in Clojure’s built-in Agent thread pool. In my case there are 10 threads in the pool, and I find that the hanging thread is the third (open the image in a new tab for higher resolution):

Here’s a closer look at the stack trace, where the bottom of the stack is our HTTP client retirer function calling the Clojure http.async.client library:

It appears the call to close is waiting on a lock, specifically a java.util.concurrent.CountDownLatch. Fortunately, the underlying Netty library is open source, so I can inspect the Java class that's holds the lock. Here is the line that we're stuck at. If I search the class for other references to shutdownLatch, I can find the place where the lock is intended to be released. This is inside an infinite loop within the run() method of this class, so there must be another thread stuck in this loop, preventing the shutdown section from being run. At this point, I'm going to have to do a bit of visual scanning through the Yourkit thread table, focusing on orange (Waiting) or red (Blocked) colored threads. I scroll down to a group of threads called "New I/O worker" and find a single waiting thread:

Here’s the relevant portion of the stack trace:

Here is the call to http.async.client/safe-get in the Clojure async HTTP library, which was invoked by the onCompleted callback of the request. Notice the docstring of the body function: "Gets body. If body have not yet been delivered and request hasn't failed waits for body.". It appears that something interesting happened: we received an successful HTTP response with an empty body. I looked at the existing test suite for http.async.client and found a test for a similar scenario, although it did not use an onCompleted callback (essentially a synchronous HTTP request). I created a new test that specifies an onCompleted callback which dereferences the body, and then attempted to close the client on a separate thread, delivering a promise if the call to close returns. As expected, the test fails. Even more promising, when I tested at the REPL, I was able to find two threads with the same stack traces as my production instance. This sure looks like the cause of the problem.

The Fix

After all that work, the fix is fairly simple. There was already a check to deliver a nil body when a successful response was empty, it just happened to be after the onCompleted callback was invoked. The fix just moves that before the callback. I also added a wrapper function for an asychronous close, which the underlying Java library already provided. This way closing the client will happen on a new thread without holding up the agent thread if something else goes wrong in the future.

Retrospective

A few factors conspired to make this problem difficult to debug. The first is that the condition that causes the leak (one of our third-party APIs returning an empty body with an otherwise successful HTTP response) is pretty rare. We issue about 2 to 3 million requests per day to these APIs and we would see the problem occur every three to four days, so roughly one in 10 million API requests returned a problematic response. Relatedly, there was over a month (and multiple production deploys) between the initial commit that precipitated the problem and the first production instance going down, making it difficult to isolate what had changed. It would have been impossible to debug this problem without setting up the custom metrics and alerts in CloudWatch. Being able to connect the profiler and REPL also proved to be very efficient tools for debugging the system once it began to exhibit the problem. And of course, you can’t beat having good informative logs messages.

Footnotes

  1. ^ Cyber Monday in fact, although we do not see the extreme spikes in traffic that the retail industry experiences.
  2. ^ It’s actually pretty rare for us to have a single instance online for two weeks. We usually deploy new releases of the service once or twice a week, but we slowed down our cycle due to the holiday.
  3. ^ As an alternative, you could combine Auto Scaling Group notifications with AWS Lambda functions to create and delete alarms.
  4. ^ A note on security: We only have HTTP and SSH ports open on our production instances, with the SSH port only accessible via a Bastion host. So a somewhat complicated SSH tunnel into the instance to listen to the profiler agent on localhost is required to do this.
  5. ^ We actually have the nREPL server turned off by default, and enable it using an authenticated endpoint. The same note on security above applies.
  6. ^ You can use the free Java Mission Control instead to do this investigation, but I do not have our production JVMs configured to accept remote connections.