Follow the stacktraces — jvm performance profiling

6 min readMar 14, 2018

We recently reduced our page response time by a factor of three, because of a very memory expensive implementation on our hot-code-path. This article focuses on the detective work to get there and that you don’t need to be a highly trained performance specialist to find these kind of problems. The JVM has some great tools that help you.

Setup

Before we dive into the details, you need a basic understanding of our technical environment and how we work at gutefrage.net.

We code on and deploy from master. This means 10–40 releases per day. Mesos and Aurora make this possible.

Our backend consists of ~25 microservices written in Scala on top of the amazing Finagle framework.

Developers and designers work together based on the atomic design pattern by creating handelbar templates for each component. We render these templates with the jknack handelbars library and serve them with an akka-http powered server. A finagle-http client fetches the required data from our REST API to render our pages.

Scala is our core language and the JVM our core runtime.

Response Times and other metrics

We monitor the response times for all our important pages. At gutefrage this is the “QDP” — the question detail page. These pages generate the majority of our traffic, thus the response times are crucial for our users.

These are the response times for the first week of February

75% of all requests are handled in 450ms, which is ok. The p99 however is very volatile with up to 1.5 seconds. This means a client waits over 1.5 seconds until the first byte arrives.

We also monitor some basic jvm stats, like the total threads and the threads waiting count:

This is a regular week. A drop usually means a deployment is going on.

Response times increase

We have seen the usual metrics. Then something happend on the 7th of march around ~20:00 o’clock. Response times started to increase up to 7s in the 75 percentile.

So we started digging.

#1 Find the responsible commit

We think that one of the biggest benefits of continuous integration is the immediate feedback. You push something and can watch the metrics. Thus our usual approach is to git bisect and find the responsible commit.

As the response time was the metric we wanted to fix, we tested things locally with Apache ab. ab fires a configured number of requests with a configurable concurrency.

ab -n 150 -c 4 “http:/localhost:8080/frage/was-ist-der-klare-unterschied-zwischen-case-class-und-case-object-scala"

We started to search for a commit were the response times heavily differ from the current ones. We couldn’t find one. Even the commits three days before the increased response times had similar response characteristics as the current code on master.

Sophisticated guess

Some code change had to be the reason; maybe with some delay. So we made a sophisticated guess without proof or evidence. Surely not an advisable thing for performance debugging. We had a bigger refactoring in our request context that was built for each incoming request:

the feature switch context was now built on each request. eagerly
the feature switch context accesses a map which contains the information if a feature is active or not for the given request context
the data structure was guarded with an @ volatile

Our theory was something like:

We generate the context on each request and not on demand, so there may be a high contention on the variable.

Our only clue was the thread waiting count that increased.

We were wrong. Neither lazy loading the features switch context nor removing the volatile keyword changed any of our metrics.

#2 Profile — looking for hot methods

Meanwhile our Operations team shared a helpful piece of information. The machine load on our frontend servers increased dramatically. The chart also shows the load after we fixed the issue.

We fired up Java Mission Control that ships with the JDK. It’s the successor of JVisualVM and lets you dig deep into a running JVM instance. You only need two extra jvm parameters to enable flight recordings. In our build.sbt this looked like this

javaOptions in reStart ++= Seq(
  "-XX:+UnlockCommercialFeatures", "-XX:+FlightRecorder"
)

Update
You can use the FlightRecorder in your dev environment without a Oracle license. However you need one if you want to turn this feature on in production. This may change in the future

We generated some traffic with ab and started a flight recording. Our expectation was to find some call path that was highly CPU intensive. However the results were as they should be

No big outliers. Most of the CPU time is either spent in commonly used packages (scala and java lang) or in our rendering engine.

#3 Profile — looking for clues

Only highly optimized code can leverage the power of multi core systems.
— me trying to sound clever

In my few years as a software developer I had only seen 100% CPU usage on all cores in an application, when there was a memory leak and the garbage collection was trying to free memory.

With this in mind we took a look at the Memory tab in the Java Mission Control center. And we found the evidence we were looking for.

It took only 300 requests to produce an Allocation Rate of 660 Mb/s. This resulted in 32 garbage collection cycles where the longest took 256ms. That is way too much garbage for a few requests.

In the Allocations -> Allocation in New TLAB tab is a nice overview called Allocation Pressure. There you can see the total size of memory that was allocated for a specific class. And you can see the full stacktrace where it was allocated and for how much memory this path is responsible.

98% of the top three allocations were produce by a call to java.lang.Throwable.fillinStackTrace. Inspecting the stacktrace revealed that a NumberFormatException was created in our CaseClassValueResolver. This particular class takes care of resolving a variable name inside a case class. It also takes care of index accesses for sequences. The code looks like this

override def resolve(context: scala.Any, name: String): AnyRef = {
  (context, Try(name.toInt).toOption) match {
    case (l: SeqLike[_, _], Some(numericName)) => // list access
    case (product: Product, _) => // resolve name from case class
    case _ => ValueResolver.UNRESOLVED
  }
}

We decide if the name is an integer or not by wrapping it into a Try and converting it to an Option (throwing away the NumberFormatException). On each template render we threw an exception for every field in every template we use. Due to the nested and separated nature of atomic design, we have a lot of templates and our view models aren’t what I would call slim.

The fix was trivial though.

// check if the name seems to be a number.
val numericNameOpt = if (name.forall(Character.isDigit))
                         Some(name.toInt) else None
(context, numericNameOpt) match {
   // ...
}

This resulted in an amazing drop of our response times. The p99 is now somewhere between 200 and 300ms.

Lessons learned

Never use exceptions for control flow in your hot code path
Monitoring and profiling is key to find performance bugs
The JVM and Akka are both amazingly stable products that kept working for over a year with this kind of gc pressure