JDK 18 and the UTF-8 as default charset

An overview about the change of the default charset in JDK 18

Andrea Binello
5 min readMar 27, 2022

The new JDK 18, released few days ago (exactly on March 22, 2022), has introduced an important change about the “default” charset. This article discusses this new feature, also providing some code to check the new rules.

What is the default charset?

A character set (“charset” for short, see Character encoding on Wikipedia) is always involved (either explicitly or implicitly) when there is a conversion between a sequence of bytes and a sequence of characters (char in Java). There are several classes in the standard Java framework that make use of this kind of conversion, like FileReader, FileWriter, PrintStream and others from the java.io package, but also for example Scanner and Formatter from the java.util package.

When you write a code in Java like the following:

FileWriter fw = new FileWriter("data.txt");Scanner sc = new Scanner(new File("data.txt"));

you are implicitly using the default charset, which is provided by the static method Charset.defaultCharset() available since JDK 5.

Note that you can also use one of the other constructors that takes an explicit charset.

Before JDK 18 the default charset heavily depends on the operating system. It is UTF-8 in many (but not all) Linux machines and it is also UTF-8 on MacOS machines except in the POSIX C locale. On Windows machines the default charset may be very different, for example Windows-1252 (especially in Western Europe) or for example Windows-31j (Japanese).

This clearly can pose some compatibility problems. If you use new FileWriter("data.txt") on a Windows machine and then that file is moved to a Linux machine where you use new FileReader("data.txt"), you will almost certainly have encoding problems if the file contains non-ASCII characters like “à” or “ô”.

UTF-8 as default from JDK 18

Starting from JDK 18 the default charset is always UTF-8, unless it is explicitly configured otherwise. Thus for example a new FileWriter("data.txt") on JDK 18 always uses UTF-8 by default and on every platform. Again, this also can pose some compatibility problems.

If you want to explore this issue, you can try the following sample code:

Before looking at the results, there are some notes:

  1. The native.encoding system property exists since JDK 17. It contains the charset choosen according to the operating system and JDK rules, regardless of the value of the default charset (either explicitly configured or not).
  2. The other three system properties (sun.jnu.encoding, sun.stdout.encoding and sun.stderr.encoding) are used internally by the JDK. They are unspecified/unsupported but at least they are explained in the JEP 400 (see link at the end of the article).
  3. The charset() method of java.io.Console class exists since JDK 17. If you want to try the above code on a JDK < 17, please comment the line.

Result on an Oracle JDK 17.0.2 (Windows 10 OS)

> java CharsetInfo

Java Runtime version 17.0.2+8-LTS-86
---------------------------------------------------------
Charset.defaultCharset() = windows-1252
System.getProperty("file.encoding") = Cp1252
System.getProperty("native.encoding") = Cp1252
System.getProperty("sun.jnu.encoding") = Cp1252
System.getProperty("sun.stdout.encoding") = cp850
System.getProperty("sun.stderr.encoding") = cp850
System.console().charset() = IBM850

Note that both file.encoding and native.encoding system properties reflect the same encoding of the default charset (Cp1252 is an alias for windows-1252). Also note that the charset of console and standard-output/error may be very different and has nothing to do with the default charset.

Result on an Oracle JDK 18 (Windows 10 OS)

> java CharsetInfo

Java Runtime version 18+36-2087
---------------------------------------------------------
Charset.defaultCharset() = UTF-8
System.getProperty("file.encoding") = UTF-8
System.getProperty("native.encoding") = Cp1252
System.getProperty("sun.jnu.encoding") = Cp1252
System.getProperty("sun.stdout.encoding") = cp850
System.getProperty("sun.stderr.encoding") = cp850
System.console().charset() = IBM850

Note here that on JDK 18 the default charset is now UTF-8 and this is also reflected in the file.encoding system property. While the native.encoding system property reflects the “native” encoding according to the operating system, which is still Cp1252 (windows-1252).

If you use a JDK 18 and you want to revert to the “old” behavior, there is one simple and documented way: you can set the file.encoding system property to COMPAT when you launch the application:

> java -Dfile.encoding=COMPAT CharsetInfo

Java Runtime version 18+36-2087
---------------------------------------------------------
Charset.defaultCharset() = windows-1252
System.getProperty("file.encoding") = Cp1252
System.getProperty("native.encoding") = Cp1252
System.getProperty("sun.jnu.encoding") = Cp1252
System.getProperty("sun.stdout.encoding") = cp850
System.getProperty("sun.stderr.encoding") = cp850
System.console().charset() = IBM850

In this way the behavior is exactly the same as JDK 17 and earlier versions.

Source encoding from JDK 18

The JDK 18 has not only changed the default charset at runtime, but also at compile time! Java has always allowed to write a .java source file in any charset/encoding you prefer (e.g. ISO-8859–1, Windows-1252, UTF-16, etc…) at condition that this charset either matches the default charset or is explicitly specified with the -encoding option of the javac command.

If you want to check this issue (especially on Windows OS), you can try the following small source code:

import javax.swing.JOptionPane;public class SourceCharsetTest {
public static void main(String[] args) {
JOptionPane.showMessageDialog(null, "àèìòù");
}
}

Note: ensure that the file is written and stored as UTF-8 (use for example the Notepad++ editor setting the encoding to UTF-8).

Result on an Oracle JDK 17.0.2 (Windows 10 OS)

C:\Samples>javac -version
javac 17.0.2
C:\Samples>java -version
java version "17.0.2" 2022-01-18 LTS
Java(TM) SE Runtime Environment (build 17.0.2+8-LTS-86)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.2+8-LTS-86, mixed mode, sharing)
C:\Samples>javac SourceCharsetTest.javaC:\Samples>java SourceCharsetTest

The dialog box shows a “strange” message because the source file is encoded in UTF-8 while the default charset on my machine is Windows-1252. Thus the Java compiler has misintepreted the literal string thinking that it contains 10 characters instead of 5 (note: characters like “à” “è” “ì” “ò” “ù” are 2-bytes in UTF-8).

Result on an Oracle JDK 18 (Windows 10 OS)

C:\Samples>javac -version
javac 18
C:\Samples>java -version
java version "18" 2022-03-22
Java(TM) SE Runtime Environment (build 18+36-2087)
Java HotSpot(TM) 64-Bit Server VM (build 18+36-2087, mixed mode, sharing)
C:\Samples>javac SourceCharsetTest.javaC:\Samples>java SourceCharsetTest

In JDK 18 the default encoding for source files is now UTF-8. This perfectly matches the encoding of the source file SourceCharsetTest.java I have written, which is UTF-8. In this case the Java compiler has correctly interpreted the literal string as composed of 5 characters (not 10).

This change in JDK 18, again, may pose compatibility problems at source level. However, if you have always used the -encoding option or you have always used IDEs/tools where you explicitly set the source encoding, then you should not have problems migrating to JDK 18.

For example, with Apache Maven it’s very common to use in a pom.xml the property:

<properties>
.........
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

Using this configuration the Maven build is always independent of the default charset.

If you want to have more precise and detailed informations, you can read the JEP 400, https://openjdk.java.net/jeps/400

--

--

Andrea Binello

I am an Italian Senior Java “Back-End” Developer; SCJP5/SCWCD5 certified; I have an Italian blog https://andbin.it mainly about Java programming.