Truly Final Optimization in Zing VM

Anna Thomas
Azul Systems
Published in
11 min readSep 19, 2018

This blog post is a condensed version of the talk presented at JVM LS 2018[1].

Zing is Azul’s commercial flagship server-side JVM. We have a couple of key features such as the C4 GC, an essentially pauseless ReadyNow technology that helps to eliminate warmup issues. And, as of early 2017, we productized Falcon, which is our LLVM-based top-tier JIT. All of the truly final optimizations are implemented for the Falcon JIT compiler.

Let’s start with what the Java Language Spec (JLS) says about final fields. The JLS has a bunch of rules for final fields. The main one, is that a field that is declared final, is initialized once and it never changes under normal circumstances [2]. This lets the compiler do pretty aggressive optimizations such as moving the reads of final fields across synchronization barriers and across unknown calls. The final field can be cached in the register and not have to reload from memory.

Next, the JVM Spec (JVMS) also has couple of chapters on final fields. First, if a field is final it must be declared in the current class. Second, the putfield bytecode must occur in the init method of the class where the final field is declared, so this is within the lexical scope of the init method [3]. However, VMs tend to have a more relaxed rule around this, specifically the putfield bytecode can be in any method of that class, not just the init method. The reason VMs tend to have a relaxed rule is because they need to support real-world applications such as deserialization. Third, every field has got a field-info structure, and the ACC_FINAL flag in that struct is set to true for a final field.

There are two kinds of final fields. The first one is the static final. Here, the class variable itself is declared final, and it must be initialized in the static initializer (<clinit> method) of the defining class. The nice thing about static finals is that it does not need any sort of dependency tracking mechanism. They are trusted by default, except for certain fields in the java.lang.System class [2].

The trickier one is the instance final fields, and these are non-static variables that are declared final. The main thing is that a blank final must be assigned in every constructor of the class where it is declared [4]. The reasons instance final is tricky is because it’s not always final, even if it is declared so. Specifically, we can change the value of an instance final after construction through a couple of ways such as JNI, unsafe, reflection, deserialization and MethodHandles. Reflection and deserialization use unsafe as the underlying mechanism. MH’s default lookup mechanism does not allow to modify final fields, but the unreflectSetter API is specifically for modifying final fields.

As we can see for the most part, we can JIT compile code with the truly final optimizations turned on. However, if the field is changed through one of the ways stated above, we need to deoptimize the code that depended on the field being final and recompile code with truly final optimization turned OFF. In short, truly final optimization is a speculative or heroic optimization that needs its own dependency tracking mechanism.

Implementing Dependency Tracking in Zing VM

The first step is identifying finality, and we went ahead with the class-wide truly final tracking mechanism. This means that all final fields in a class are either trusted or untrusted together.

We have a truly final bit for a class, and this is set during verification. During verification we parse the byte codes of the method and we see that it satisfies the JLS rule, which is that all non-static final fields should be assigned in the constructor of the class, i.e. within the lexical scope of <init> method. If this rule is violated, the truly final bit of that class is set to false.

Fig 1: Dependency Tracking Mechanism

Dependency Tracking Mechanism (Fig 1) gives an idea of the process of dependency recording and invalidation. Let’s say that a method that uses a final field gets JIT compiled. The optimizer first checks whether the class that has this field has the truly final bit set and the field is actually final. Then, we need to record that dependency, because an optimization is true only as long as the dependency is true. Then we do the optimization. When we register the method, we need to install the dependencies along with the method, and then the method gets installed.

At the same time, or at a later time, let’s say that a final field gets modified. The first step is we need to set the truly final bit of that class, which contains the final field, to false, so that no later optimizations use this fact because it is no longer true. Next, we need to see if there were any other methods that depended on this field being final, and if there are other dependent methods we need to de-optimize these methods. This is the classic deoptimization as used for things like CHA. After we deoptimize these methods we will write the field. The deoptimization occurs on an installed method. Also, there can be a race between recording the dependencies and the dependency invalidation, so just before installing the dependencies we need to see if the truly final bit of that class is still true. If it is not true, then we need to throw away the compile and start re-compilation.

We added the deopt checks at relevant hooks such as unsafe, JNI, methodHandles. At those hooks, we need to traverse the dependent methods to check if that method had a truly final dependency. If we have even one method that has a truly final dependency, then we need to deoptimizate that method. The key thing to note here is that the writes must wait until we finish the deoptimization.

Truly Final Optimizations

The next part is the truly final optimizations we implemented for Falcon, our JIT LLVM based compiler. LLVM is a set of compiler libraries that operates on an intermediate representation called LLVM IR [5]. LLVM IR is SSA based typed IR.

In the Java VM, we have the C1 Compiler which acts as our lower-tier compiler and does profiling. Into this mix we add our bytecode front-end, which is essentially a parser. It converts the bytecode into LLVM IR. The LLVM optimizer is present outside of this VM, and here what we do is we pass this initial IR that’s generated by the bytecode front-end into the LLVM optimizer.

Fig 2: VM With LLVM Optimizer

The LLVM Optimizer tries to optimize the initial IR provided. In the process, it asks the VM certain questions and VM responds with answers. The optimizer uses this information and optimizes the code better. It generates a final IR that is passed to the X86 backend and converted into assembly. The method is then installed.

The final field optimizations are constant folding of compile time constants, then redundant load elimination of final fields; this is across non-aliasing stores, across synchronization barriers, and across unknown calls.

Let us start with a constant folding example shown in Fig 3. This is pseudo-code with JMH notations. TrulyFinal class has a final primitive i that is assigned the value 10. The benchmark we are interested in is test which loads i from a static final object tf.

Fig 3: Constant Folding Example

The (pseudo) LLVM IR test method gets lowered into is:

%ref = load i8* , i8** @known_value
%addr = getelementptr i8, i8* %ref, i32 16
%i = load i32, i32* (cast<i8* to i32*> %addr)
ret i32 %i

Since LLVM IR is type-based IR, we have types such as i32 which represent 32 byte integers. Our bytecode parser also tags compile time constants, i.e. static final variables as known_value. Then we are accessing 16 bytes off the known_value, and then we are loading i and returning the result.

The LLVM optimizer sees this known_value and the load following it, so it asks the VM: “Can we get the result of this known value load?” The VM answers “yes, it is 10.” So, the optimizer uses this result and simplifies the final IR to:

ret i32 10

The VM callback is a complete black box for the LLVM optimizer. The VM operates on the arguments passed in and does the recording of the truly final dependency we talked about earlier. In this case, the callback is GET RESULT OF KNOWN VALUE LOAD(KV ID, OFFSET, SIZE). In the callback, we retrieve the object corresponding to the KVID and get the field based on the OFFSET specified. After checking if the field is final (and recording that truly final dependency), we retrieve the constant value corresponding to that field.

The next example we look at is redundant loads of final fields within a constant region. In Fig 4, we have the pseudo-code where the benchmark of interest is the test function. We have the load of an instance final object utf and a load of a final primitive var within it. We again reload the same var but the loads are separated by a store to a volatile variable k. Since, volatiles introduce memory barriers, we have a strict ordering of loads and stores before and after the memory barrier.

Fig 4: Truly Final loads in Constant Region

This is the test LLVM IR:

%ref = load i8*, i8** %addr
%addr = getelementptr i8, i8* %ref, i32 16
%var = load i32, i32* (cast<i8* to i32*> %addr)
%kref = load i8*, i8** %addr_k
%addr = getelementptr i8, i8* %kref, i32 816
fence
store i32 3, i32* (cast<i8* to i32*> %addr)
%refdup = load i8*, i8** %addr
%addr1 = getelementptr i8, i8* %refdup, i32 16
%tmp = load i32, i32* (cast<i8* to i32*> %addr1)
%res = sub i32 %var, %tmp
ret %res

We have the load of the object utf, which is not a known_value. Then we load var at 16 bytes from that object. In LLVM IR, memory barriers are represented using fence instruction. We store to the volatile k and then we reload the object utf and var again and subtract both the loads of var. Since, both the object and the primitive loaded from that object are truly final, we know that the result should be zero. How does the optimizer figure this out?

First, the optimizer checks that the object has already been initialized, and then asks the VM about the caller property because the region that we are interested in here is the caller that is the test function. It asks the VM, “Is the ref field modified in the test function?” We parse in the type of the field, the offset, the size, and the caller. The VM responds, “No, it is not modified” which means that we are now allowed to get rid of the second load of that reference field refdup. The IR gets reduced to:

%ref = load i8*, i8** %addr
%addr = getelementptr i8, i8* %ref, i32 16
%var = load i32, i32* (cast<i8* to i32*> %addr)
%kref = load i8*, i8** %addr_k
%addr = getelementptr i8, i8* %kref, i32 816
fence
store i32 3, i32* (cast<i8* to i32*> %addr)
%addr1 = getelementptr i8, i8* %ref, i32 16
%tmp = load i32, i32* (cast<i8* to i32*> %addr1)
%res = sub i32 %var, %tmp
ret %res

Then the optimizer recursively asks the VM about the second load of the primitive final var. “Is var modified in the test function”. VM again responds “No, it is not modified”. This identifies the second var load is redundant as well. We are finally left with the code, where all we have is a store to the volatile field and a return of zero, so we were able to get rid of four loads in the IR:

%kref = load i8*, i8** %addr_k
%addr = getelementptr i8, i8* %kref, i32 816
fence
store i32 3, i32* (cast<i8* to i32*> %addr)
ret i32 0

The VM callback for this optimization is IS MODIFIED IN REGION(TYPE, OFFSET, SIZE, REGION). The REGION here is test. In the callback implementation on the VM side, we check if REGION is the dynamic scope of the constructor that can modify the klass corresponding to TYPE. If it’s not then we go ahead and we extract the field from that offset and we check if that field is constant. Then, we record the dependency if that field is final and we say that, because it’s a final field and the region is not a constructor, it is not modified in that region.

The redundant load elimination across unknown calls follow along the same lines as the constant region optimization. It uses the same callback, but the region we are interested in, is the unknown callee. More details are available in the talk online.

Performance

Let us consider a real JMH microbenchmark example and how it performs with the Zing VM with truly final optimization. In Fig 5, we see a final array sumMeUpFinal, which is summed up in the benchmark sumFinal. Within the loop in sumFinal, we have a volatile load for mask in every iteration of the loop. We also have the null check on the final array and the invariant load of length of the sumMeUpFinal array. On a VM without truly final optimization, we will have to do the null check and the load of length on every iteration of the loop. Since Zing VM has support for truly final optimization, we are able to hoist the null check and the load of the length of that array out of the loop through loop invariant code-motion, and we are able to get rid of all these instructions within the loop.

Fig 5: JMH Truly Final Example

When run on skylake Intel(R) Xeon(R) CPU E3–1220 v5 @ 3.00GHz, there is a 2.5 X gain with Zing 18.04 when compared to Zing with the truly final optimizations turned off.

We also ran experiments to see how the truly final optimizations perform on real world applications. Fig 6 shows the results on skylake and broadwell machines over 90 benchmarks (the dots on the graph represent the benchmarks). The benchmarks are Specjvm, Dacapo, Kafka, and certain application benchmarks. The baseline at 100% represents the truly final optimizations turned off. The green dot represents the gains that we see with truly final optimizations turned on. The blue and the green dots that’s close to the baseline are essentially variation. The key point here is that we have around 27% of the benchmarks that have gains between 2–11% with truly final optimizations turned on.

Fig 6: Performance with Truly Final Optimizations

In summary, final non-static fields are not always final and because of this we need dependency tracking and invalidation mechanism. We have optimizations for final fields that are implemented both for reference final and primitive finals. We see non-trivial gains across microbenchmarks and real world applications with truly final optimizations, and the gains that we see are usually from constant folding and better aliasing.

References

[1] https://www.youtube.com/watch?v=2HfnaXND7-M&frags=pl%2Cwn

[2] https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.5

[3] https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.putfield

[4] https://docs.oracle.com/javase/specs/jls/se7/html/jls-8.html#jls-8.3.1.2

[5] https://llvm.org

--

--