Observing Java 19 JVM optimization with JMH + hsdis + PerfASM: Holy trinity of low-level benchmarking — Part II

Martin Stypinski
10 min readAug 8, 2023

--

The previous article of this mini-series focused on assembly code generation using the hsdis library to intercept the Java JIT emitted output. In this piece, we continue to dive deeper into the execution platform, the physical CPU, where the code runs.

Motivation

This article aims to show how to use JMH with Perf. In order to provide more than an installation guideline, we will dive into two versions of matrix multiplication and their implications on runtime. The algorithms of choice are a simple baseline algorithm and a cache-optimized version. Cache congestion is easily detectable with Perf and is a good starting point to explain the tool.

Installation

The installation of Perf is straightforward on any Linux machine. In this case, a Ubuntu 22.04 LTS was used, and the Perf-Tool suite can be installed through the package manager.

sudo apt install linux-kernel-tools

Perf is a tool to monitor, amongst others, hardware and software events, such as cache loads, branching, or page faults triggered by the OS. By default, Perf uses an event level of 4 which is fine for any system as it is not too invasive on the host. Therefore, only a minority of all important events can be detected by Perf. We need to get as much information as possible for development purposes by sacrificing security considerations. Consequently, we want to set the ‘perf_event_paranoia’ to -1:

sudo /etc/sysctl.d/perf.conf > kernel.perf_event_paranoid = -1

The easiest way is to set the parameter during OS startup. During a normal start, all files in the ‘/etc/sysctl.d/’ folder are evaluated, and Perfs’ ‘perf_event_paranoia’ can be set for good. (The default config will be used once again by deleting the file.)

Methods

The selected algorithm is matrix multiplication, well suited to benchmark computational throughput. The amount of computation is reasonably high compared to the amount of memory reads and writes. A simple implementation runs in O(n^3) time complexity and involves much computation while maintaining a reasonably low amount of memory access.

High computational intensity, low memory intensity

With growing matrix sizes, memory access will become an area worth optimizing. Large row or column items contain enough floating point numbers to exhaust the CPUs L1 cache. Subsequently, it is the perfect example to investigate cache congestion on the larger matrix multiplications.

Baseline

The baseline case is the simplest algorithm implementation and is typical homework for a Programming 101 class. The algorithm takes each row and column vector and multiplies it. In this example, the only difference is that the matrix is represented as a single-dimensional array. Consequently, the next column entry (A[0,0] -> A[0,1]) of the matrix is at position i->i+1, whereas the next row entry (A[0,0] -> A[1,0]) of the matrix can be found at i->i+matrix_width. Resulting in the following code form: A[x][y] => V[x*matrix_width+y] (for zero-indexed arrays).

2D matrix represented in a 1D array

Flattening a 2D array into a 1D array works well for squared arrays. Any further deep dives and reasons for doing so are avoided to keep the main story of the article.

public float[] baseline(float[] a, float[] b, int n) {
float[] c = new float[n * n];

for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
float sum = 0.0f;
for (int k = 0; k < n; k++) {
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
return c;
}

Blocked

This simple algorithm works well, especially for small matrix sizes. Its limitation is that growing the matrix makes each row and column longer and can exhaust the CPU’s fast L1 cache and degrade performance.

The hypothesis is that by implementing the algorithm in a blocked way, the CPU can use more cache local data and therefore avoid latency through RAM access penalties. Data stored in cache is immediately accessible, while accessing data in memory takes time. Therefore, the blocked algorithm tries to exploit as much cache local data as possible and to reduce the amount of memory round trips to improve overall performance.

Cache miss is expensive; we better avoid it!

The term blocked refers to slicing the array up into sub-areas, where data can be kept fairly close to the ALU (Arithmetic Logical Unit), and by computing sub-area by sub-area, the performance should improve by increasing cache-locality and decreasing random memory access.

An optimal block size can be determined by benchmarking the algorithm and depends on cache size and CPU architecture. In this case, a Xeon Gold 6212U was used with fairly big cache sizes compared to some desktop counterparts. The blocksize yielding the best results was 16. Consequently, the smallest matrix width to run is 16, as a smaller size can not be handled by the provided implementation. Another limitation is that only multiples of the block size can be computed. Still, this limitation improves code readability and is carefully chosen to prove the point.

public float[] blocked(float[] a, float[] b, int n, final int blocksize) {
float[] c = new float[n * n];

for (int kk = 0; kk < n; kk += blocksize) {
for (int jj = 0; jj < n; jj += blocksize) {
for (int i = 0; i < n; i++) {
for (int j = jj; j < jj + blocksize; ++j) {
float sum = c[i * n + j];
for (int k = kk; k < kk + blocksize; ++k) {
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
}
}
return c;
}

Setup

The setup is a normal JMH + Java Gradle project with runtime parameters for perf. Accordingly, a sample config could look like this:
(This depends on setup and goal of profiling, so your preferences might be different.)

jmh {
iterations = 5 // Number of measurement iterations to do.
benchmarkMode = ['sample'] // Benchmark mode. Available modes are: [Throughput/thrpt, AverageTime/avgt, SampleTime/sample, SingleShotTime/ss, All/all]
batchSize = 1 // Batch size: number of benchmark method calls per operation. (some benchmark modes can ignore this setting)
fork = 1 // How many times to forks a single benchmark. Use 0 to disable forking altogether
failOnError = true // Should JMH fail immediately if any benchmark had experienced the unrecoverable error?
forceGC = false // Should JMH force GC between iterations?
jvm = "/usr/bin/java"
humanOutputFile = project.file("${project.buildDir}/reports/jmh/human.txt") // human-readable output file
resultsFile = project.file("${project.buildDir}/reports/jmh/results.txt") // results file
operationsPerInvocation = 10 // Operations per invocation.

// perf profiler needs to be setup here:
profilers = ['perfnorm:events=L1-dcache-loads,LLC-loads,LLC-load-misses,L2_RQSTS.ALL_DEMAND_DATA_RD,CYCLE_ACTIVITY.STALLS_TOTAL,CYCLE_ACTIVITY.STALLS_L1D_MISS,CYCLE_ACTIVITY.STALLS_L2_MISS,CYCLE_ACTIVITY.STALLS_L3_MISS,CYCLE_ACTIVITY.CYCLES_L1D_MISS,CYCLE_ACTIVITY.CYCLES_L2_MISS,CYCLE_ACTIVITY.CYCLES_L3_MISS'] // Use profilers to collect additional data. Supported profilers: [cl, comp, gc, stack, perf, perfnorm, perfasm, xperf, xperfasm, hs_cl, hs_comp, hs_gc, hs_rt, hs_thr, async]

timeOnIteration = '30s' // Time to spend at each measurement iteration.
resultFormat = 'CSV' // Result format type (one of CSV, JSON, NONE, SCSV, TEXT)
synchronizeIterations = false // Synchronize iterations?
timeUnit = 'ns' // Output time unit. Available time units are: [m, s, ms, us, ns].
verbosity = 'NORMAL' // Verbosity mode. Available modes are: [SILENT, NORMAL, EXTRA]
warmup = '1s' // Time to spend at each warmup iteration.
warmupBatchSize = 10 // Warmup batch size: number of benchmark method calls per operation.
warmupForks = 0 // How many warmup forks to make for a single benchmark. 0 to disable warmup forks.
warmupIterations = 1 // Number of warmup iterations to do.
warmupMode = 'INDI' // Warmup mode for warming up selected benchmarks. Warmup modes are: [INDI, BULK, BULK_INDI].
warmupBenchmarks = ['.*Warmup'] // Warmup benchmarks to include in the run in addition to already selected. JMH will not measure these benchmarks, but only use them for the warmup.
zip64 = true // Use ZIP64 format for bigger archives
jmhVersion = '1.29' // Specifies JMH version
includeTests = false // Allows to include test sources into generate JMH jar, i.e. use it when benchmarks depend on the test classes.
duplicateClassesStrategy = DuplicatesStrategy.EXCLUDE // Strategy to apply when encountring duplicate classes during creation of the fat jar (i.e. while executing jmhJar task)
}

Results

The raw output of the benchmarks should look similar to the following table; again, as machines are different and hardware is different, the output will look different!

Benchmark                       size  Mode Cnt        Score        Units
------------------------------------------------------------------------
mmBaseline 32 avgt 5 4633.607 ns/op
mmBaseline:CPI 32 avgt 0.250 clks/insn
mmBaseline:IPC 32 avgt 4.004 insns/clk
mmBaseline:L1-dcache-load-misses 32 avgt 7.738 #/op
mmBaseline:L1-dcache-loads 32 avgt 16674.312 #/op
mmBaseline:L1-dcache-stores 32 avgt 435.131 #/op
mmBaseline:L1-icache-load-misses 32 avgt 0.996 #/op
mmBaseline:LLC-load-misses 32 avgt 0.035 #/op
mmBaseline:LLC-loads 32 avgt 0.081 #/op
mmBaseline:LLC-store-misses 32 avgt 3.760 #/op
mmBaseline:LLC-stores 32 avgt 3.910 #/op
mmBaseline:branch-misses 32 avgt 3.522 #/op
mmBaseline:branches 32 avgt 4962.787 #/op
mmBaseline:cycles 32 avgt 14377.633 #/op
mmBaseline:dTLB-load-misses 32 avgt 0.140 #/op
mmBaseline:dTLB-loads 32 avgt 16575.955 #/op
mmBaseline:dTLB-store-misses 32 avgt 0.071 #/op
mmBaseline:dTLB-stores 32 avgt 433.736 #/op
mmBaseline:iTLB-load-misses 32 avgt 0.014 #/op
mmBaseline:iTLB-loads 32 avgt 0.006 #/op
mmBaseline:instructions 32 avgt 57568.506 #/op

The output is according to the profile parameters in the JMH start-up settings.

By aggregating the most important metrics and visualizing the data, we can have a deeper look at the cache access differences between both algorithms.

The first plots show strictly the data aggregated in terms of runtimes. The first sub-plot shows the runtime performance in terms of overall speed compared to array size — on a logarithmic scale. The second-sub plot shows the runtime per operation or element. The third sub-plot illustrates the speed-up quantifying a performance up-lift,

Runtime benchmarks on the simple and blocked matrix multiplication algorithm.

The second group of plots focuses on cache access, mainly L1 cache, and the resulting performance. The squared matrix size was incrementally increased from 16 to 2048 row and column width, always by a power of 2 (2⁵ — 2¹¹). (The size 16, the smallest size, is an implementation limitation of the blocked algorithm.) The second sub-plot shows the L1 cache access per element compared to L2, L3, and Ram access. The last sub-plot shows the ratio of L1-Cache access in proportion to all the other data lookups.

Deep dive into the caches of both algorithms.

Discussion

So apart from learning how to install and use perf in a Java JMH Setup, the question remains: What did we learn from this experiment?

First of all, the results show that the simple implementation has a hypothetical lower-bound of O(n^3), as expected. Cache congestion and latency penalties start to occur at a matrix width of 2⁸ to 2⁹. At this point, data needs to be loaded from L2, L3, or RAM. This increases latency noticeably and decreases computational throughput by stalling the ALU (Busy Wait), ultimately leading to performance degradation.
(Actually, it does not stall the ALU on modern processors, but Hyper-Threading or SMT kicks in; it changes the data flow through the ALU onto another pipeline. For processors without HT, an ALU stall can occur. In the given case, the task gets stalled due to unavailable data.).

Further diving into the RAW data shows a spike in L3-Cache access on matrix sizes above a width of 512 (2⁹). It is no coincidence that performance degradation can be measured by almost a factor of two at this point. The results were gathered on an Intel Cascade Lake (Skylake+) platform. Intels documents reveal that an L1D-cache load into a register needs 4 cycles, the best-case L2 cache load needs 12 cycles, and the best-case L3 cache load needs 44 cycles. (Further reading in the resource section.)
Loading data from the slower L3-Cache takes 11 times more time than loading data from the L1-Cache, so it better be there! These numbers also explain why the higher L1-Cache access ratio dramatically improves the blocked algorithms' performance! Yes, cache-locality is still an important area of optimization for high-performing algorithms!

To conclude, not only have we established a meaningful toolkit with Java JMH + hsdis.so + Perf to investigate low-level JVM code generation, but we can also understand the results and the consequences.

Conclusion

Although Java is run on the Java Virtual Machine and the Just-in-Time compiler is responsible for emitting the platform's correct assembly code, the same rules apply as for compiled languages. Cache misses are still very expensive and should be avoided if performance is the most important aspect of a particular algorithmic implementation.

Although not entirely written as a tutorial, this article should provide developers who need to look under the JVM hood with a toolbox to debug and profile certain effects. This is not a toolkit for daily driving but for these cases where algorithms must be optimized for the utmost performance. It can also be used to understand what is happening on the hardware level of a particular piece of code.

Acknowledgment

  • Christoph Amrein: For rearranging some pieces, adding background information, and improving the overall post quality.
  • Marc Juchli: For his valuable input in improving the readability and fixing some coherency issues.
  • Kirusanth Poopalasingam: For his valuable input in some key passages of the blog post.

Resources:

--

--

Martin Stypinski

IT Entrepreneur | Hackathon enthusiast | Co-Director CAS Machine Learning for Software Engineers at Fachhochschule Ostschweiz