Java Series: Basics Q&A Part 6

Thank God it is weekend, time to catch up with this series!

Q24. How to generate a Java class in runtime?

This time we’ll focus more on Java runtime a bit deeper since we’ve discussed Java class loader and Java parent delegation model.

In order to create a class to be used in runtime, the most common way is just to write some Java code and leverage javac to compile it into .class file and load into JVM. Another way is just to spawn a javac subprocess and compile another java file and load with class loader in the runtime.

What else can we do?

As we discussed on JVM section, JVM only cares if the byte code fed in are legit, even you’ve byte code that can’t be generated by common Java code like setting [boolean a = 10;], you can make that up in byte code and JVM is totally fine with that since boolean is loaded as integer type if allocated on stack.

This kind of Java byte code manipulation was regarded as magic long time ago, but it’s becoming a common tool for infrastructure engineer to inject code into application code in order to perform extra procedures.

Let’s recap how byte code is converted into a Java class, it is implemented through the following 2 methods:

protected final Class<?> defineClass(String name, byte[] b, int off, int len, ProtectionDomain protectionDomain);
protected final Class<?> defineClass(String name, java.nio.ByteBuffer b, ProtectionDomain protectionDomain);

Basically any kind of legit byte arrays, either through local disk or network, we can load them into classes. You’ll also find JDK dynamic proxy is implemented in a similar way by generating byte code with ProxyGenerator, save as byte[] and call Unsafe.defineClass. Generating code through hard coded byte code is fairly complicated:

private void codeLocalLoadStore(int lvar, int opcode, int opcode_0,
DataOutputStream out)
throws IOException
{
assert lvar >= 0 && lvar <= 0xFFFF;
// 根据变量数值,以不同格式,dump 操作码
if (lvar <= 3) {
out.writeByte(opcode_0 + lvar);
} else if (lvar <= 0xFF) {
out.writeByte(opcode);
out.writeByte(lvar & 0xFF);
} else {
// 使用宽指令修饰符,如果变量索引不能用无符号 byte
out.writeByte(opc_wide);
out.writeByte(opcode);
out.writeShort(lvar & 0xFFFF);
}
}

You’d have to gain a deep understanding of JVM byte codes beforehand. Luckily, the community already offers a bunch of libraries to operate byte code. For example: the ASM library in JDK. ASM uses a visitor pattern and provide ways for you to traverse the metadata inside a .class file. An example:

ClassWriter cw = new ClassWriter(ClassWriter.COMPUTE_FRAMES);
cw.visit(V1_8,                      // 指定 Java 版本
ACC_PUBLIC, // 说明是 public 类型
"com/mycorp/HelloProxy", // 指定包和类的名称
null, // 签名,null 表示不是泛型
"java/lang/Object", // 指定父类
new String[]{ "com/mycorp/Hello" }); // 指定需要实现的接口

Byte code manipulation is used in a lot of domains: mocking frameworks, ORM frameworks, IoC frameworks, Profiler tools, etc.. Byte Buddy as mentioned in another post was also a common tool to operate byte code.

Q25. JVM memory model and which section of JVM memory would OOM happen?

JVM has the following memory regions:

  1. PC. Program counter register. This stores the executing JVM command and undefined if native methods.
  2. JVM stack. Each thread creates a JVM stack and stores stack frame for each Java method invocation.
  3. JVM heap. Java objects are stored here and heap is shared by all Java threads in this JVM. The size is configured through Xmx flag.
  4. Method area. This is also shared by all threads and used to store Meta data like class structure, constants, methods, etc.
  5. Runtime Constant Pool. Part of method area, this is used to store Java constant information.
  6. Native method stack. Every thread would create one.

Below is a diagram of the regions for a JVM:

In reality, we need to pay attention to:

  1. Direct memory region is not actively managed or garbage collected through GC, while when you explicitly invoke System.gc().
  2. JVM itself is a native process and it also needs memory to do JIT compilation, gc, etc.

Back to the original question, where would OOM happen?

OOM basically means JVM doesn’t have enough memory and garbage collector can’t offer more memory either. So before OOM is thrown, GC will:

  1. Collect SoftReference objects.
  2. System.gc()

If OOM did happen, it usually happens in:

  1. Most common OOMs happen in heap memory and throws a “java.lang.OutOfMemoryError: Java heap space”.
  2. OOM also happens in Java stack or native stack, for example when an infinite recursion would lead to StackOverFlowError and if JVM failed to expand the stack, it would throw OutOfMemoryError.
  3. In case of older version of JDK, the perm section is limited and JVM won’t GC that region actively. If we keep adding new types dynamically or too much cache for intern string cache, perm gen could also throw OutOfMemoryError: PermGen space.
  4. After the introduction of Metadata region. Method Region’s load is alleviated and the error is normally now “OutOfMemoryError: Metaspace”.
  5. Direct Memory could also throw OOM.
Q26. How to monitor & diagnose JVM memory usage?

There are a couple ways to understand JVM usage:

  1. Leverage some visualization tool like JConsole, VisualVM.
  2. jstat or jmap to check stack, heap and method area usage.
  3. use jmap to get heap dump and use jhat & Eclipse MAT to analyze further.
  4. GC log.

Note that off heap memory can’t be used by the tools mentioned above, but you can try Native Memory Tracking, note that, as mentioned before, NMT introduces 5% — 10% overhead in runtime.

We already showed the overall structure of JVM memory space, let’s dive deeper into the Heap region:

Java heap can be categorized into the following sections:

  1. Young Generation

YG is used for most objects creation and destruction. In common Java applications, the life span of objects are fairly short. Inside YG, there are a couple more divided regions — Eden, used for initial object allocation; Survivor, used for store objects survived Minor GC. Virtual, this is the uncommitted space for all generations, used for expanding different regions.

As seen in the graph, there are two survivor regions. During GC, JVM would pick one of them randomly and copy objects survived in Eden region there to avoid memory fragmentation.

Inside Eden region, Hotspot JVM has something called TLAB (Thread Local Allocation Buffer):

This space is allocated for each thread as a private cache area. If top meets end, it would ask for more space from Eden.

2. Old Generation

OG is used to store objects with long life span. The objects are normally copied over from Survivor region, exception is when an object to be allocated is too large and Eden is not able to store it, the object will be allocated directly in Old Generation.

3. Permanent Generation

Used to store Java metadata, constant, intern strings, it disappeared after JDK 8!

A couple common JVM configurations:

  • -Xmx value : Max heap size
  • -Xms value : Min heap size
  • -XX:NewRatio=value : Ratio between OG and YG size.

If Xms is smaller Xmx, the size of heap won’t be pushed to its maxium, instead, the space reserved will be larger than committed space. As memory requirement increases, JVM will gradually expand memory space and use the Virtual space mention above.

Let’s take a look like JMC memory usage graph:

In order to use NMT to analyze JVM, we can set up the following JVM flag:

-XX:NativeMemoryTracking=summary

Also, add the following to fetch and print information when exits:

-XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics

You’ll see this:

$ java -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics Foo
Hello, Java!
Native Memory Tracking:
Total: reserved=5704847KB, committed=460923KB
- Java Heap (reserved=4194304KB, committed=262144KB)
(mmap: reserved=4194304KB, committed=262144KB)
-                     Class (reserved=1066111KB, committed=14207KB)
(classes #429)
(malloc=9343KB #156)
(mmap: reserved=1056768KB, committed=4864KB)
-                    Thread (reserved=19535KB, committed=19535KB)
(thread #19)
(stack: reserved=19456KB, committed=19456KB)
(malloc=57KB #105)
(arena=22KB #38)
-                      Code (reserved=249644KB, committed=2580KB)
(malloc=44KB #326)
(mmap: reserved=249600KB, committed=2536KB)
-                        GC (reserved=163627KB, committed=150831KB)
(malloc=10383KB #129)
(mmap: reserved=153244KB, committed=140448KB)
-                  Compiler (reserved=133KB, committed=133KB)
(malloc=2KB #32)
(arena=131KB #3)
-                  Internal (reserved=9452KB, committed=9452KB)
(malloc=9420KB #1418)
(mmap: reserved=32KB, committed=32KB)
-                    Symbol (reserved=1374KB, committed=1374KB)
(malloc=918KB #89)
(arena=456KB #1)
-    Native Memory Tracking (reserved=41KB, committed=41KB)
(malloc=4KB #44)
(tracking overhead=37KB)
-               Arena Chunk (reserved=626KB, committed=626KB)
(malloc=626KB)

The first section is Java heap and the section section is the space used for Class metadata. This can be configured with

-XX:MaxMetaspaceSize=value

Next is thread, as you can tell, we have 19 threads with a simple HelloWorld program. This is because of GC and JIT.

Then we have Code statistics section for CodeCache, which is used to store information from JIT compiler. JVM also provides some flags for that:

-XX:InitialCodeCacheSize=value
-XX:ReserveCodeCacheSize=value

Compiler section is used for JIT cost. The Internal area is used to store statistics information about Direct Buffer and others.

Q27. What are the common Java GC choices?

Here are the common ones:

  • Serial GC. Most ancient GC and it is serial. It would slip into StopTheWorld state during GC. It is the simplest GC implementation and easy to instantiate. Thus, it is the default option for client JVM. It uses Mark-Compact algorithm for old generation.
-XX:+UseSerialGC
  • ParNewGC. An implementation of Young Generation GC and it is actually a multithreaded version of serial GC, normally used with CMS GC.
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
  • CMS (concurrent mark sweep) GC. This GC is based on Mark & Sweep algorithm. Its goal is to reduce stall time. However, Mark & Sweep algorithm introduces memory fragmentation problem and hard to avoid GC in long run. Also, since it is concurrent, it competes with user threads for CPU resources sometimes.
  • Parallel GC. Both Young & Old generation’s GC are parallel and it is default GC choice for server JVM.
-XX:+UseParallelGC

You can tune Parallel GC with the following options and JVM would adapt accordingly.

-XX:MaxGCPauseMillis=value 
-XX:GCTimeRatio=N // GC 时间和用户时间比例 = 1 / (N+1)
  • G1 GC. G1 takes both throughput and latency into account and is the default GC after JDK 9.

How GC works?

Object GC:

  • Reference algorithm. If the reference count to an object is 0, it is recyclable. This is not adopted in Java since this algorithm can’t handle dependency cycles.
  • Reachability analysis. This is also called Tracing Garbage Collection. It draws objects and their references in a graph and pick active objects as GC Roots. By tracing the reference chain, if an object is not reachable from GC roots, it will be recycled.

Metaspace GC:

For GC in metadata space, class unloading requires the corresponding class loader unloaded first and this needs attention in scenarios with a lot of dynamic classes.

GC algorithms:

  • Copying Algorithm. This happens for Young Gen GC. It copies live object to survivor region to avoid memory fragmentation. However, copy means extra memory space waste.
  • Mark & Sweep. It marks objects needs recycled first and then sweep them. However, this is not very efficient and can’t avoid memory fragmentation.
  • Mark & Compact. It moves objects while clean up.

GC Procedures:

  1. Application keeps creating objects and they’re normally allocated in Eden region and triggers minor GC when reaching threshold. Reference objects (green box) survive and will be copied to JVM Survivor region, other wise recycled (yellow box)

2. After minor GC, Eden will be a bit empty. When it reaches minor GC again, the other survivor region will be set as the TO region. Live objects in Eden and objects from S0 will be copied to To and the survivor age will be +1.

3. Step 2 keeps happening till some objects reaching age threshold and they will be promoted to the Tenured region. This threshold can be set via:

-XX:MaxTenuringThreshold=<N>

Then it is the Old Generation GC. It has different algorithms. Here is Mark & Compact Algorithm, after removing recyclable objects, it would reorganize the objects to avoid memory fragmentation:

We normally call Old Generation GC — Major GC and the GC of the whole heap — Full GC.

Q28. How do you tune GC?

There are normally 3 objectives of GC tuning: footprint, latency and throughput. Most scenarios would focus 1 or 2 of them. A summary of how to tune GC:

  1. Understand the application and requirement — what objects are we targeting at.
  2. Understand the status of JVM and GC, locate the problem and whether we do need to tune GC.
  3. Which type GC should we choose.
  4. Tweak parameters and hardware configuration based on analysis.
  5. Validation procedures to make sure targets hit.

How G1 GC works?

G1 also has the concept of different generations, however, its memory structure is quite different than before, it is composed of blocks:

Each block is of size 1M to 32M (power of 2). Some blocks are Eden and some blocks are Survivor. G1 would categorize objects larger than 50% of the block as Humongous objects and put them in the Humongous block.

G1 is still a mixed GC algorithm:

  • G1 still uses parallelized copy algorithm for Eden generation and this would cause StopTheWorld pause as well.
  • G1 uses Concurrent Mark for old generation and Compact happens with Young Generation GC.

A state machine of G1:

One key concept of G1 is Remembered Set, it is used to record and maintain the references between regions(blocks). Remembered Set is used to ensure the references between different generations are visible:

Remembered set takes 20% or more of heap size. Some characteristics of G1:

  • G1 records the object references between Old Generation regions. The number of Humongous objects are limited and we can quickly know if there is object in old generation referring it. If there is no reference in Old Generation, the only blocker to it been garbage collected is there is a reference in Young Generation and this information can be retried during Young Generation GC. As a result of this, the GC of Humongous objects don’t have to wait for Conc & Mark.
  • During GC, G1 would reorganize new string objects into a queue and concurrently dedupe the queue after Young GC. This can be activated through:
-XX:+UseStringDeduplication

This saves a lot of memory but introduces some overhead for CPU.

  • G1 unload types after Conc & Mark instead of waiting for full GC.

A couple general GC tuning advice:

  1. UPGRADE TO LATEST JDK.
  2. Understand how to collect GC tuning information:
-XX:+PrintGCDetails 
-XX:+PrintGCDateStamps
-XX:+PrintAdaptiveSizePolicy // Print G1 Ergonomics information
-XX:+PrintReferenceGC // Debug references not clean up
-XX:+ParallelRefProcEnabled // Debug parallel reference.

Check this in JDK 9:

java -Xlog:help

3. If Young GC takes time, that usually means Young Generation is too big, try use:

-XX:G1NewSizePercent
-XX:G1MaxNewSizePercent

4. If Mixed GC has long latency:

As mentioned above, Old region is included in Mixed GC, reduce the number of regions would reduce the latency.

-XX:G1MixedGCCountTarget
Q29. What is happen-before in Java memory model?

Happen before in Java refers to the mechanism to ensure operation visibility in a concurrent environment in JVM memory model (JMM). It is far beyond synchronized, volatile, lock operation:

  1. In a thread, each operation would happen-before later operations.
  2. volatile variables’ write operation happen-before read operation.
  3. object’s construction happen-before finalizer.
  4. Thread operations happen-before Thread.join().

The reason we use happen-before instead of happens before is because this is more than the time during execution but visibility of the operation between threads.

Why do we need to define JMM? This is used to simplify multi-threading programming and ensure program portability. In early versions of C or C++, there is no memory model and they rely on the memory ordering model of the chipset. That leads to inconsistency of the same program in different hardware architectures.

Java wanted to solve the problem and introduced the JMM. However, it is extremely complicated.

The internal implementation of JMM is through memory barrier, which prevents some instruction reordering and provides memory visibility. Let’s take volatile as an example and see how volatile is implemented according to the JMM model through memory barrier. For a volatile variable:

  • Compiler inserts a write barrier after a write operation.
  • Compiler inserts a read barrier before read operation.

Memory barriers make sure modifications to a volatile variable is visible to all threads. Or in other words, write battier would force CPU to flush cache.

For example, the condition variable below should be declared as volatile, otherwise, changes to variable condition might stay in CPU cache for a while.

// Thread A
while (condition) {
}
// Thread B
condition = false;
Q30. What are the problems to run Java program in Docker or other containers?

Docker’s resource isolation is implemented through CGroup and it is not recognizable before Java 8u131. As a result of that, there are a couple issues:

  1. If JVM memory is not properly configured, JVM might use memory exceeding its limit and leads to OOM kill.
  2. JVM is not aware of the CPU limitation and thus making wrong assumptions on GC parallelization.

How to solve this problem?

  1. Upgrade to latest JDK.
  2. Explicitly set heap, metadata space size.
  3. Explicitly set GC and JIT parallel count.
  4. Explicitly set MaxRAM:
-XX:MaxRAM=`cat /sys/fs/cgroup/memory/memory.limit_in_bytes