JAVA 9 — A more space-efficient String

A simple proposal — to save space by changing how Strings are internally stored.

If you do nothing else but recompile your code with JDK 9 you will see the memory footprint of your application reduced … probably.

Before Java 9

Strings are stored and manipulated as character arrays. Each character is stored using two bytes (16 bits) - UTF-16.

By running a bunch of data analysis (http://cr.openjdk.java.net/~shade/density/state-of-string-density-v1.txt and http://cr.openjdk.java.net/~huntch/string-density/reports/String-Density-SPARC-jbb2005-Report.pdf) on many different applications it became clear that a significant chunk of memory was being devoted to Strings. When looking deeper into these Strings most of them hosted only Latin-1 characters. Given that Latin-1 characters can easily be represented with one byte quite, a lot of memory can be reclaimed if the internal representation of Strings changes.

Background information

UTF is a family of multi-byte encoding schemes. UTF-8 in particular, is a flexible length encoding scheme. Meaning it will use anywhere between one and four bytes to store it’s characters.

ISO-8859 is a family of single byte encoding schemes. ISO-8859–1, in particular, is also known as Latin-1 and is fit for some western European character sets.

Unicode is the attempt to unify all encoding standards. It supports 1 114 112 code points. This is an enormous amount of characters, in fact, it is so big that private ‘sections’ are used for some specialised uses like to reproduce Klingon characters (http://www.klingonwiki.net/En/Unicode).

Fun Fact unicode defines a wildcard character: � (U+FFFD) this character, also referred to as replacement character, can be used when dealing with a unicode character that cannot be decoded correctly.

If you are still confused try this excellent blog on the subject: http://kunststube.net/encoding/

With Java 9

Java 9 has seen the introduction of an encoding-flag field declaring which encoding was used and the reduction of the two bytes character array to a one byte.

The beauty of it all is that no API has been hurt during this very impactful change!

As always, I’ve based this interpretation on the JEP that lead to the change: http://openjdk.java.net/jeps/254.

Please let me know of your thoughts and if you have felt any difference in memory consumption.