Monitoring of application latency using jHiccup

About JHiccup

jHiccup is a simple application that helps measure operation system (OS) “hiccups” — freezes or latencies from perspective of a particular application. It was written by Gil Tene CTO of Azul to measure latencies of OS.

Why latency is so important?

We live in a time of network applications. Most of applications, which are launched on our machine regularly, fetch data from the internet. If we open a browser and visit google.com, we observe that browser makes around 50–60 network requests (Fig.1). In the case of more complex sites, we observe hundreds of requests. A lag of any request may cause a delay in rendering the whole site.

Fig. 1. It took 56 requests to open google.com

The same idea can be easily applied to the client-server applications and popular micro-services. If in a chain of micro-service requests one of micro-service replies later than usual, it may delay the whole workflow. For example, a slow response about item’s price from DB can slow down the whole purchase workflow in the online shop. Therefore, when we talk about the performance and latencies of application it is important to check not only the best or average result but also the worse result.

When do we need OS monitoring

Quite often developers receive complaints regarding slow responses from the clients or other systems. Unfortunately, it is tricky to reproduce such problems locally or even spot them in production. In such situation, many developers immediately start looking for a problem in the code. For this, they use for examples logs, metrics and profiler. However, correct performance analysis should start from bottom to top: starting from hardware and OS level and finishing with an application.

Most OS are not real-time operation systems. It means, they cannot guarantee how long a particular operation will take. Therefore, when the application is running , started on such OS, its performance may vary dramatically. In other words, the application may not get CPU time in a particular moment. In such case, it doesn’t matter what code is used by your app. Below are reasons why the app may “sleep” without a chance to make a progress:

  1. OS may perform an internal GC.
  2. Another resource-intensive apps may use CPU or other resources.
  3. OS may run on top of hypervisor.

Why jHiccup?

There are many tools and metrics that show the load of different system components with a different level of details. The issue with such metrics is that for every metric developer should answer two questions:

Is this metric a root cause of latency?

Whether a particular level of this metric abnormal?

jHiccup allows us to look on the system from application perspective. jHiccup is a small app with a simple function: endless thread falls asleep and asks OS to wake it up in a particular interval, e.g. 1 second. If the OS was busy in 1 second and was not able to wake up the thread, then the app noticed this by comparing awakening time with planned awake time (time when the thread fall asleep + 1 second). We can plot latencies during application run (Fig. 2).

Fig.2. The x-axis shows a program execution time in seconds. The y-axis shows the latency in milliseconds (how long the program was waiting before OS start execute it.)

If we know a time when customer complained about the slowness of our system, we can check plot of our latency (actual awakenings minus planned awakenings) at that moment. Using this data we can conclude whether the OS was a reason for latency or not.

On the last plot, we saw the relationship between latency and the execution time of application. In addition, it is useful to take all latencies and sort them in ascending order. It gives us an idea about the latency distribution and its probability (Fig. 3).

Fig.3. The latency in milliseconds (y-axis) and its probability (x-axis).

jHiccup doesn’t suffer from Coordinate Omission described by Gil Tene in his video.

Internally jHiccup uses a histogram as a data structure. Normal histogram splits the whole interval of latencies (e.g. from 1 ms to 1 sec) into equal segments. Then it calculates how many latencies fall into each interval. It allows to store data in a more compact way than a simple list of observed values (1.55ms, 2.6ms, etc).

Actually, jHiccup uses special implementation of histogram — HDR-Histogram which has some advantages:

  • HDR-Histogram has a high resolution. We can see not only 95% and 99% of worse results, but more detailed data (e.g. 99.999%);
  • HDR-Histogram saves data in a compact structure that allows measuring the performance for a long time. It works because the size of histogram segments increases exponentially. This helps to store the abnormal values in a more compact way.

HDR-Histogram library became quite popular. It is possible to find the implementation in different programming languages. Different tools, that analyze metrics, support HDR-histogram as one of the internal formats due to its сompactness and accuracy.

Why do we need such accuracy?

On the Figure 3, we see the data which represent 99.9999% of observations. Some may argue if we need such accuracy, and should we check the data beyond 95% or 99% percentile? Let’s consider two examples. Assume the probability of abnormal latency P(A) is 5% and 1% respectively. The question is: what is the probability P(B) that user experience an abnormal latency ?:

  • First, we observe that browser makes around 60 network requests to render google.com. Let’s consider online-store site where it takes 200 network request to buy an item. If P(A)=5% then P(B)=1-(0.95 power 200)=99.997%. If P(A)=1% then P(B)=1-(0.99 power 200)=86.6%. In this case user will observe abnormal latency with probability 99.997% or 86.6% respectively.
  • Second, let’s imagine that we have 10 micro-services. Each micro-service receives two requests during a particular workflow, therefore, client observes 20 requests. If P(A)=5% then P(B)=1-(0.95 power 20)=64.15%. If P(A)=1% then P(B)=1-(0.99 power 20)=18.2%. So, user will observe abnormal latency with probability 64.15% or 18.2% respectively.

Therefore, in most cases, it is not enough to analyze data only before 95th or 99th percentile.

How to use jHiccup

jHiccup can be downloaded from http://www.azul.com/downloads/jhiccup/ or https://github.com/giltene/jHiccup.

./jHiccup -d 4000 /usr/bin/java org.jhiccup.Idle -t 300000 
# 4 seconds delay on recording start, total recording time 300 seconds. By default thread wakes up every 5 seconds (could be changed using -i)
# created file hiccup.170617.1120.3461.hlog
./jHiccupLogProcessor -i hiccup.170617.1120.3461.hlog -o hiccup.170617.1120.3461
# created files hiccup.170617.1120.3461 and hiccup.170617.1120.3461.hgrm

You can view the file hiccup.170617.1120.3461 using an excel-file jHiccupPlotter.xls from the jHiccup folder.

./jHiccup -d 4000 /usr/bin/java org.jhiccup.Idle -t 300000 
# 4 seconds delay on recording start, total recording time 300 seconds. By default thread wakes up every 5 seconds (could be changed using -i).
# created file hiccup.170617.1120.3461.hlog
./jHiccupLogProcessor -i hiccup.170617.1120.3461.hlog -o hiccup.170617.1120.3461
# created file hiccup.170617.1120.3461 и hiccup.170617.1120.3461.hgrm
Fig. 4.
Fig. 5.

To plot *.hgrm files, we can use the online tool https://hdrhistogram.github.io/HdrHistogram/plotFiles.html . This tool also useful to compare a few hdrm-files (i.e. during different system load or from different servers) (Fig. 6).

Fig.6.

When we compare our application performance chart (i.e. http-requests response time) with the jHiccup chart, we can understand whether our application is slow or the whole system at a certain moment.

Usually we start jHiccup as a separate process. Another way is to launch it as a javaagent together with our app.

java -javaagent:jHiccup.jar="-d 0 -i 1000 -l hiccuplog -c" MyProgram.jar -a -b -c

In this case, jHiccup wakes up and saves information about latency while application is running.

There is one major difference in these two approaches. In the first case, jHicuup runs on a separate JVM but in case of javaagent it runs on the same JVM. Therefore, in the second case we observe OS latencies and JVM latencies, which runs our app (i.e. GC pauses).

It is possible to add SLA lines to the chart using jHiccupPlotter.xls file (Fig. 7) .

Fig. 7.

From my point of view, plotting SLA may be useful in next situations:

  • Once you run jHiccup on the same JVM with your app and you know the maximum time of request processing (i.e. 100 ms), excluding OS and JVM latencies, you may simply add this value to jHiccup chart. The resultant SLA may be provided to clients/systems which use our system.
  • If you don’t have an access to the production system and it is hosted by a separate team or company, the SLA chart, built by jHiccup, may be used to verify SLA provided by a hosting provider.

jHicсup is a handy tool for monitoring system “hiccups” — including JVM and OS freezes. Unlike general system metrics, it allows to look at the system load from the application perspective.