String Deduplication in Java
This article aims to explain and demonstrate what String Deduplication in Java is.
TL;DR
String Deduplication allows multiple Strings to share the same underlying character array. You can activate it as follows.
-XX:+UseG1GC -XX:+UseStringDeduplication
I’m using OpenJDK 13, as long as your using java version 9 or above you will be able to follow along and reproduce the results presented here.
First, let’s explain the basics of the String type. The String contains a field called value which holds the actual content (the character).
You probably know that it is probably bad practice to create new String objects instead of using something called a “String literal”.
What’s the problem with the code above? Let’s analyze it. I’m going to use VisualVM, but there are other options available. Run the following commands (I’m using a “-” sign to deactivate String Deduplication for clarity, but it is deactivated by default).
> javac StringDeduplication.java
> java -XX:+UseG1GC -XX:-UseStringDeduplication StringDeduplication
Then open up another console window to extract the heap dump (replace {PID} with your pid).
> jcmd {PID} GC.heap_dump -all ~/Desktop/stringdeduplication-1.hprof
Open this dump in VisualVM, select “Objects” then “GC Roots” and navigate to the ArrayList, you will find that the ArrayList contains 4 elements as below.
Notice here that the first 4 strings are different references, while the last two are the same. Further, expanding the elements, you can note that the first two strings will also point to two different byte arrays (the value field in String).
The reason the “bad” string will contain the same value field is due to the String literal being used in the String constructor which takes it from the String pool. In this case it’s unnecessary to use the String constructor, there are some special cases where it can be useful, not discussed here.
What to do? Here is where String Deduplication comes into the picture. (Note that String Deduplication only works for the G1 garbage collector.) Run the following commands to start the application.
> javac StringDeduplication.java
> java -XX:+UseG1GC -XX:+UseStringDeduplication -Xlog:gc*=info StringDeduplication
Then open up another console window to force a GC then extract the heap dump (replace {PID} with your pid).
> jcmd {PID} GC.run
> jcmd {PID} GC.heap_dump -all ~/Desktop/stringdeduplication-2.hprof
The reason we do a GC first is because this is when the String Deduplication happens, which can result in slightly longer pause times. But hopefully this should make other phases more efficient as fewer objects needs to be moved around.
Now open up VisualVM and note the difference in the underlying value field of the two first strings.
Voila, they now share the same underlying value field.
References
[1] VisualVM
https://visualvm.github.io/
[2] Java® Development Kit Version 13 Tool Specifications
https://docs.oracle.com/en/java/javase/13/docs/specs/man/index.html
[3] The java Command
https://docs.oracle.com/en/java/javase/13/docs/specs/man/java.html
[4] Java HotSpot VM Options
https://www.oracle.com/java/technologies/javase/vmoptions-jsp.html
[5] The jcmd Command
https://docs.oracle.com/en/java/javase/13/docs/specs/man/jcmd.html