Troubleshooting in the Dark

Avihay Tsayeg
skai engineering blog
8 min readFeb 8, 2018

Troubleshooting JVM applications is a skill you can master with relative ease. Given the abundance of online resources and decades of accumulated JVM troubleshooting experience, a developer working with JVM languages really has to own it.

Troubleshooting an unfamiliar JVM application on Docker in AWS behind load balancer is a little harder.

This became apparent to me on a recent production incident I was asked to handle in one of our products.

Episode 1

I was approached by an engineer from a team responsible for a highly scalable data-oriented microservice. The engineer said they were unable to resolve a “major production” issue.

The team was mostly new . The more experienced team members who had built the application had already moved on to other projects or companies.

Me: “So what’s the story? Tell me all about it.”

Engineer: “This application has been running for a couple of years and has suddenly started to hiccup (meaning it had stopped processing data) for long periods of time. This happens right after every deployment.”

Me: “Do you have any leads on what got this issue started? Were there any major changes? Can you reproduce it?”

Engineer: “I have no idea. There were no major changes and we can’t reproduce it in our staging environment.”

Well, there were no good answers here. We were completely in the dark.

And so the conversation continued…

Me: “What are the symptoms?”

Engineer: “The load balancer seemed to identify one of the hosts as unhealthy, and then it terminated the host and replaced it.”

Me: “What’s the app’s architecture like?”

Engineer: “A couple of servers on an EC2 behind an ELB. When the machine starts up it retrieves the required app version and starts a Docker container.”

Me: “Can we see the application logs?”

Engineer: “Yeah, we ship them to Logz.io. There’s nothing there.”

Me: “Can we see the machine logs?”

Engineer: “Nope. When the load balancer identifies that the host is not healthy it removes it entirely, losing all the local files.”

The engineer was starting to show signs of real concern…

Me: “OK, OK. So first we need to keep the machine alive a bit longer, so that the logs remain available for further investigation.”

So we added an autoscaling group lifecycle shutdown policy to the ELB, to get a grace period before the machine shuts down. This policy is mainly used for drying up the machine before we remove it.

Episode 1 Conclusions:

  • Ship all your logs to a log aggregator like Logz.io.
  • Know the lifecycle hook of the AWS autoscaling group.

Episode 2

We looked at an alert triggered in AWS, which notified us about one of the machines going down. This time, we could see the logs!

Looking at the app’s GC logs, we couldn’t find anything interesting:

Next, we took a look at the Docker daemon log. A very clear message indicated that it can’t allocate memory, causing the app container to continuously restart (See Configure and troubleshoot the Docker daemon).

It seemed the timestamp of the logs matches the time of the app hiccup that appeared in monitoring a few hours before.

Looking at the machine logs under /var/log/messages, we discovered OOM-killer messages (kernel panic), indicating that the machine is running out of memory and killing its largest consumer, which, not surprisingly, was the Docker container.

So I had a few more questions for my fellow engineer:

Me: “Say, did we limit the Docker container memory allocation? What are the heap size limits?”

Engineer: [scratching his head] ”Hmmm… there’s no limit. I don’t really know.”

Me: [Thinking to myself] So…

We didn’t limit the container memory and we over-allocated the JVM heap size. On top of that, we didn’t set the max heap size to equal the min heap size (Xmx=Xms. This forces the JVM heap usage to start at max).

That is also why the OOM incident did not show up immediately. It takes some time to over-allocate the entire heap size.

Episode 2 Conclusions:

  • Limit your Docker container memory:
    docker run — name=app — memory=1G hello-world
  • Read Docker’s user guide (Run your app in production section).
  • Calculate how much memory you allocate to the JVM heap: take into account the overheads used by the OS, Docker daemon and other processes, and remember Java memory usages can reach over 25% of the heap.
  • Set Xmx=Xms. This is a best practice performance-wise and will identify over-allocation earlier on.
Screenshot taken from Oracle’s JVM Tuning Considerations

Episode 3

Engineer: Wait! We’re not done yet. It’s happening again… same symptoms!

Me: [:-(] Oh no, what is it now??

Luckily the machine was still up so we were able to examine it (see Episode 1 Conclusions).

Looking at the GC log again we now saw it loud and clear:
“out of memory Exception in thread “main” java.lang.OutOfMemoryError: GC overhead limit exceeded

We now needed to understand what caused this GC overhead. We added 3 parameters to the jvm that will dump the heap into a (zipped!) file upon an out of memory incident, and upload it to S3 (with proper file retention).

The 3 JVM parameters are:

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/log-dir/app-`HOST`-`date +%s`-pid$$.hprof “
-XX:OnOutOfMemoryError=upload_to_s3_script.sh

NOTE: The order of these arguments might be important. Read more in this stackoverflow discussion, or write a “smart” script for it.

Now that we were able to get the heap-dump file, there was no need to use the lifecycle hook. (We still use it for teardown.)

To allow remote monitoring of the JVM and remote actions such as dump heap, thread dumps and creating performance snapshot (.nps files), we added the following JVM parameters:

-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=1234
-Dcom.sun.management.jmxremote.rmi.port=1234
-Djava.rmi.server.hostname=HOST (the hosting machine IP/name)

See GitHub docker example and some more relevant info.

Score!!! An email notification pointed us to the heap-dump file on S3.

By analyzing the heap dump, we found the memory leak as well as the reason why it started just now:

A MySQL query result loaded into memory had become bigger because of a certain change to the way our data behaved, and occasionally it reached a point where this was too much for our heap.

Our solution was to use a database-side cursor to stream the ResultSet instead of loading it all into memory by setting statement fetchSize to Integer.MIN_VALUE:

stmt.setFetchSize(Integer.MIN_VALUE)

In order to catch similar issues before they become critical, you should always send JVM metrics to a graphing tool (we use Hosted Graphite) so you can catch the memory behavior change before any outages occur. You can use a collectd + graphite plugin to send stats periodically.

Episode 3 Conclusions:

  • Make sure to add the OOM parameters in any JVM application
  • The JMX port is your friend. Open it (but make sure it’s secure).
  • If a query result might be large, use streaming.
  • Send JVM metrics to hosted graphite using collectd.
  • When shipping logs to the S3 bucket, make sure it has proper retention. Don’t keep large files forever.
Jconsole Mbeans

Episode 4

With a sigh of relief I went home thinking:

Me: OK, we can relax now. Issue solved, logs shipped to Logz.io, JVM metrics sent to Graphite, alerts added. We’re out of the woods now, right?

Wrong!!!
Phone rings at 2 AM…

Engineer: “Hi, it just happened again. Same symptoms!”

Me: [ :-( :-( ] “OK, I’m on it.”

We had a clear view now, the JVM was OK, the machine was OK… What was it now?

So once again, we logged into the machine, looked at the GC logs, the machine logs, and the application logs, using the top command to verify that CPU usage and current memory usage are normal.

It all seemed fine, so why did the load balancer determine the opposite?

Next step: We started checking the app endpoints, hoping it would trigger the same failure the load balancer detected.

curl localhost:8080

This one was fine, but what else could be wrong?

curl localhost:8081/healthcheck

We used healthchecks for exposing our application status and surfacing connectivity issues, like database connection, JVM memory usage and more.

Uh-oh! The healthcheck results are bad!

The healthcheck failed due to low disk space, but it shouldn’t concern the load balancer. It should only trigger an alert.

Apparently the ELB was calling the healthcheck page. We changed it to call host:8080 (and freed up some disk space on the way).

Finally, everything was OK now. The incident was marked as complete. Joy! Back to sleep.

Episode 4 Conclusions:

  • The application should expose its state using metrics or healthchecks. We should add alerts to this data and respond accordingly.
  • ELB healthchecks must check only if the app is responsive, not the app’s applicative state.
  • Make sure you write your healthcheck correctly (How? We’ll tackle that in a future post.).
Here’s the incident healthcheck page

To Sum Up:

Since then, we’ve had several other incidents. Each new incident was monitored and handled very quickly (without the necessary to log in to the host). Once we were done with that first incident , not only did we finally fix it, but now we are also much better equipped to handle future issues faster and better. New services built in Kenshoo include all the new tools (JVM metrics available in Graphite and on demand, logs, heap dumps, etc.) from the get-go.

Don’t stay in the dark. Use proactive monitoring. Even if you’re using Docker or running on AWS behind ELB, you can get clear visibility into the state of your application.

Debugging in the dark can be a terrible (yet enriching) experience which you should avoid by using proactive monitoring, metrics and logs to get a head start on your next production incident.

Don’t let things like AWS ELB or Docker get in the way of having clear visibility into the state of your application.

Good luck!

--

--