Lessons learned about monitoring the JVM in the era of containers
Quick and easy things you can do to monitor JVM when it is being used in containers
Java Virtual Machine (JVM) has been around for a long time (1994) so it’s hardly a surprise there are hundreds of tools to monitor its health and performance. We could use profiler tools (like VisualVM) to track memory leaks, CPU usage, and memory consumption but they require a direct connection to the JVM: this makes it hard to execute in a containerized production environment with autoscaling (like ours).
Other tools (like Universal GC Log Analyzer) expect to use GC log analysis to auto-detect GC problems and recommend solutions to it, but it requires to enable GC verbose logs and this might have a cost and performance impact in a production environment.
So we faced the challenge of finding a simple way to monitor JVM instances and detect problems in the regular use of our applications or during special occasions such as peak load events. We also wanted to group metrics from multiple container instances of the same type to get an overview and also to be able to compare different types at a glance.
GC compromise (latency vs throughput)
Latency represents the time that the user of the application waits for a response. In the specific case of GC tuning, this is the time that the JVM is paused because of garbage collection and cannot run the application. The duration of the garbage collection pause directly affects the way your application is perceived by end-users, so decreasing it results in a greater perception of responsiveness.
Throughput is the amount of time spent in productive work. Productive work is the time spent on processing customer transactions, non-productive work is the time spent on managing the application (es garbage collection). If we have evidence of low throughput, it means that the application is not optimally using resources.
The JVM garbage collectors can be configured to preferentially meet one between latency or throughput. If the preferred goal is met, the collectors will try to maximize the other.
Our containers run on Oracle JRE 8 where the default collector is Parallel GC. This GC is designed to be used by applications that need to do a lot of work (high throughput) and where long pause times are acceptable. This can cause latency, affecting the responsiveness of the application, not ideal in our case where our containers are API services so we switched to G1 Collector.
Garbage Collector G1 defaults have been balanced differently than either of the other collectors. G1’s goals in the default configuration are neither maximum throughput nor lowest latency, but to provide relatively small, uniform pauses at high throughput.
I think you should use G1 if you have no special needs, then tune to optimize performances. At the time of writing, we exclude new experimental GCs like Shenandoah GC in the production environment or other JVM like OpenJ9 (with Generational Concurrent policy) because we found no significant differences in our applications.
How to tune Latency and Throughput
To reduce latency, maximum pause time duration can be specified as a hint with the command-line option -XX:MaxGCPauseMillis=nnn (G1 default value is 200). In this case, the heap size and other parameters related to garbage collection are adjusted in an attempt to keep garbage collection pauses shorter than the specified value. These adjustments may cause the garbage collector to reduce the overall throughput of the application, and the desired pause time goal cannot always be met.
To improve throughput, the ratio of the time spent outside the garbage collection vs the time spent in the garbage collection can be specified with the command-line option -XX:GCTimeRatio=nnn (G1 default value is 19). The ratio of garbage collection time to application time is 1/ (1+nnn). The default value sets a goal of 5% of the total time for GC and throughput goal of 95%.
If the throughput goal isn’t being met, then one possible action for the garbage collector is to increase the size of the heap so that the time spent in the application between collection pauses can be longer. In a containerized application where only one java app runs, we need to adjust heap size and consequently the container memory in proportion 1/4 (https://docs.oracle.com/javase/9/gctuning/parallel-collector1.htm#JSGCT-GUID-74BE3BC9-C7ED-4AF8-A202-793255C864C4).
High throughput may accompany long, yet infrequent pause times.
Aggregate metrics to make them understandable
If you are monitoring an environment with many instances, placing the metrics as they are provided into a chart can lead to a very difficult to read the result.
One series per instance makes the graph unreadable, too much information, too much noise. For the same job, I suggest to pick the maximum value across all instances, to plot the worst-case scenario.
As we can see, trends and peaks are easier to track. Of course, this comes at a cost: this graph’s objective is to quickly detect high usage but it will not be useful to “debug” an issue.
Key performance indicators
So, with the default configuration, we can consider an acceptable throughput any value above 90/95%.
In our sample data, there are low variations in most applications.
The following graph shows a metric hinting that something is not working optimally:
This means there is an efficiency gain to be achieved here, we need to investigate! Or this type of graph it’s worth exploring the GC Frequency metric.
There are three types of algorithms for collectors:
- serial, which use a single thread to perform all garbage collection work, best suited to single processor machines for application with small data sets (up to 100MB)
- parallel, which use many threads to get the work done faster, to use when application performance is the priority
- concurrent, perform mostly expensive work concurrently to the application, best suited when response time is important
More info in https://docs.oracle.com/javase/9/gctuning/available-collectors.htm and here for info about new experimental GCs https://blogs.oracle.com/javamagazine/understanding-the-jdks-new-superfast-garbage-collectors
Do not underestimate GC type impact, as seen in the following real example where we changed from Parallel GC to G1 GC:
To get an idea of the impact of the GC pause and possible problems, it is useful to compare the GC pause metric with the metric of the duration of the requests.
The following graph shows a real case we encountered. We were plotting GC Duration for several different applications and found unexpected spikes.
From image, we see some spikes (yellow series) related to metric named “G1 Old Generation core-xadmin”. This means that the application is not healthy because this metric represents a full GC and ideally full GC should never happen in G1 because this type of collection performs in-place compaction of the entire heap. This might be very slow. The reason that a Full GC occurs is because the application allocates too many objects that can’t be reclaimed quickly enough. Often concurrent marking has not been able to complete in time to start a space-reclamation phase. The probability to run into a Full GC can be compounded by the allocation of many humongous objects. Due to the way these objects are allocated in G1, they may take up much more memory than expected.
Young and Mixed collections are counted in “G1 Young Generation”. G1 pauses the application during all three phases.
If GC pause time starts to increase, then it’s an indication that the app is suffering from memory problems.
For G1 you can specify the size of the young generation on the command line as with the other garbage collectors, but doing so may hamper the ability of G1 to attain the target pause time.
Frequency is how many times the garbage collector runs in a given amount of time. Collection frequency impacts throughput and can be related to object allocation/promotion rate and the size of the generation.
The metrics described are a first step in highlighting performance issues. To analyze in detail and understand how to act to carry out the tuning of the JVM it is necessary to enable the logs of the GC which provides a detailed overview during and out of the breaks on the activity of waste collection.