Introduction to character encoding

Wan Xiao
11 min readJun 12, 2022

--

Almost every programmer will encounter the garbled characters except for programmers in English-speaking countries. We are often taught: always use UTF-8. Then all the gibberish disappeared and the world returned to peace.

The full name of UTF-8 is 8-bit Unicode Transformation Format. We also often hear the noun Unicode, what is the relationship between Unicode and UTF-8? What is the difference between UTF-16 and UTF-8? Are Chinese characters really encoded in 2 bytes? What is the Modified UTF-8 in JNI? What is the surrogate area of Unicode?

Character encoding is the basics, but almost all programming books just mention it briefly. We may still be troubled by garbled characters. Here I will briefly share my understanding of character encoding. After this, if you encounter same problems again, at least you won’t be confused.

ASCII

American Standard Code for Information Interchange, maybe the most well-known character encoding standard. One byte encodes one character, only 7 bits are used to encode 128 characters. Later, EASCII (Extended ASCII) was released to use all 8 bits encoding a total of 256 characters, which are a few more strange characters than ASCII. ASCII should be the simplest encoding standard known to programmers first, and I won’t introduce it too much here.

Traditional character encoding scheme

When ASCII was proposed, it did not take into account the characters of other languages such as Chinese, Japanese and Korean. With only one byte, only 256 characters can be encoded, and characters from many countries cannot be encoded. With the popularization of computers, various countries have proposed their own character encoding standards. These standards are similar, the basic idea is to be compatible with ASCII, and use two bytes to encode other characters, such as GB2312 for simplified Chinese characters and BIG5 for traditional Chinese characters. But these standards have an obvious problem: they are struggling for their own. That is to say, they are compatible with ASCII, encode their own supported characters, and cannot encode characters in other languages. Therefore, these encoding standards can only handle bilingual texts (Latin alphabet + native language), do not support multilingual environments (multiple languages mixed). For instance, GB2312 does not support Arabic characters, if there are Chinese, English and Arabic in your text, GB2312 cannot encode all the content.

Unicode was created to solve the limitations of traditional encoding standards. Before talking about Unicode, the first thing is to understand the modern encoding model.

Modern encoding model

Rather than mapping characters directly to octets, modern encoding models divide the concept of character encoding into several layers.

Abstract Character Repertoire

That is, a set of all abstract characters supported by a system. This set includes all abstract characters. For example, “S” is a character, abstract character repertoire doesn’t care about its font style or font size, it’s just a character. Some characters are not even printable, such as “\n”.

The character repertoire can be closed, that is not allowed to add new characters, such as ASCII, while the Unicode character repertoire is open , allowing new characters to be added.

Coded Character Set

Coded character set maps each character in the character repertoire to a coordinate or a non-negative integer. The character repertoire plus the mapping relationship is called a coded character set. Unicode in a narrow sense is a coded character set. Since the characters in the character repertoire are mapped to another thing, the concept of encoding space is generated. The encoding space is simply the space that represents all characters in repertoire, which can be described by integers, coordinates, or rows, columns, planes, etc. A position in the encoding space is called a code point.

So coded character set maps all characters in reportoire to code points, and each code point represent one character.

Character Encoding Form

Character encoding form maps each code point to one or more code units to facilitate storage in a system that represents numbers as bit sequences of fixed length. For example, a system that stores numeric information in 16-bit units can directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units.

It is a mapping to itself for fixed-length encodings such as UCS-2. But for variable-length encodings such as UTF-8, the mapping is more complicated. Some code points are mapped to just one code unit for each of them, and others are mapped to a sequence of multiple code units in UTF-8. The simplest character encoding form is simply to choose a unit large enough to ensure that all values in the coded character set can be directly encoded (one code point corresponds to one code unit). This is reasonable for coded character sets that can be represented using 8-bit code units, and is reasonable enough for coded character sets that can be represented using 16 bits (such as earlier versions of Unicode). As the size of the coded character set increases (the current Unicode character set requires at least 21 bits to be fully represented), this direct representation becomes increasingly inefficient, and it is difficult to make existing computer systems support larger code units.

Character Encoding Scheme

Character encoding scheme maps code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network.

Transfer Encoding Syntax

It is used to process the octet sequences provided by the character encoding scheme. Generally, it has two functions: one is to map the value of the octet sequences to a more restricted range of values to meet the limitations of the transmission environment. For example, when Email is transmitted, Base64 encodes 8-bit bytes into 7-bits data; the other is to compress octet sequences, such as Lempel–Ziv–Welch algorithm or Run-length encoding.

This is only for tranfering, so strictly speaking, it may not be part of the character encoding model.

Unicode

When Unicode was first proposed, it was believed that only 2 bytes were needed to encoding all modern characters, but in fact, many strange characters were encoded, so the current Unicode encoding space is 0x0~0x10FFFF, which requires at least 21 bits to describe, slightly less than 3 bytes.

I occasionally see the statement “A Unicode character requires 4 bytes of storage”. In fact, Unicode as a coded character set, it does not have a concept of how many bytes a character needs to be stored. Its function is to map characters to code point values, these code point values are distributed in 0x0~0x10FFFF. The number of bytes occupied by its storage has nothing to do with Unicode character set, but only with the actual character encoding scheme used. For example, we can say that a character in UCS-4 requires 4 bytes of storage.

Unicode divides the encoding space into 17 planes, from 0 to 16. The 0th plane is called the Basic Multilingual Plane (BMP), and the other 16 planes can be collectively called the Supplementary Plane (SP).

Unicode encoding space and plane

Although the BMP (Basic Multilingual Plane) of Unicode has only 2¹⁶=65536 code points, it seems so few, but in fact most of the characters that we use fall in the BMP. Such as characters from languages of India, Arabic, Chinese, Japanese and Korean.

Unicode defines two mapping methods: UTF (Unicode Transformation Format) encoding and UCS (Universal Coded Character Set) encoding.
The most commonly used ones are UTF-8 and UTF-16, UCS-2 is the predecessor of UTF-16, and UTF-32 and UCS-4 are functionally equivalent.

UCS-2

First introduce a simple character encoding: UCS-2. It is actually a character encoding form. The UCS-2 code unit has a fixed length of 16 bits, in fact, it directly maps the code points in BMP to 16-bit code unit, which are numerically equivalent. This also determines that UCS-2 cannot encode characters in the supplementary plane.

UCS-4 / UTF-32

Since UCS-2 cannot encode all Unicode characters, the easiest way is to directly expand the code unit to 32 bits, and then map the code points in Unicode directly. The encoding form of UTF-32 is short and sweet, and consumes the most space. I don’t know who uses it.

UTF-16

When you use Notepad on Windows, you can choose the encoding form when saving text files, one of which is called Unicode, which is actually UTF-16.

The code unit size of UTF-16 is the same as UCS-2, which is 16 bits. People who are engaged in Android/Java development may think that UTF-16 is far away from themselves. In fact, it is not. String in Java, which is represented inside the JVM, uses UTF-16 encoding. The evidence is that the char type in Java has a length of 2 bytes. I used to wonder why char can store Chinese characters and why char is 2 bytes. Now we know that what we call String.charAt , we get a UTF-16 code unit. As the code unit of UTF-16 is 16 bits, just like UCS-2, why can it encode all the Unicode character set, rather than just BMP?

There are two points.

  1. One is that UTF-16 is a variable-length encoding form.
  2. The second is a special area in BMP: the surrogate area. The Unicode standard specifies that U+D800~U+DFFF does not map to any character. UTF-16 use surrogate pairs to address characters outside the BMP.

For BMP (U+0000 ~ U+FFFF), the encoding form of UTF-16 is the same as that of UCS-2, which directly maps code points to a 16-bit code unit.

For SP(U+010000 ~ U+10FFFF), the code units are calculated as follows:

First subtract 0x010000, leaving 20 bits (0x00000~0xFFFFF)

High 10 bits (range 0x0000~0x03FF) plus 0xD800, get the first 16-bit code unit, range 0xD800~0xDBFF, also known as leading surrogate.

Low 10 bits (the range is also 0x0000~0x03FF) plus 0xDC00, get the second 16-bit code unit, the range is 0xDC00~0xDFFF, which is called trailing surrogate.

A surrogate pair consisting of a leading surrogate and a trailing surrogate, represents a code point in SP.

It can be found that the range of the leading surrogate and the trailing surrogate is exactly the range of the surrogate area in the BMP U+D800 ~U+DFFF. At the same time, we find that the code points corresponding to valid characters in BMP, leading surrogates, and trailing surrogates do not overlap each other. Therefore, in UTF-16 encoded data, when we see a 16-bit code unit, we can know exactly which part it belongs to by looking at its value.

If you don’t understand what the algorithm is doing, it can be understood like this:

There are a total of 0x10FFFF — 0x010000 + 0x1 = 0x100000 = 2²⁰ code points in SP.

To completely map these 2²⁰ code points into a coordinate, both the abscissa and the ordinate need to have 2¹⁰ = 1024 values.

The leading surrogate 0xD800~0xDBFF contains exactly 1024 code points, and the trailing surrogate 0xDC00~0xDFFF also contains exactly 1024 code points. In this way, the code points in SP can be encoded by a coordinate (leading surrogate, trailing surrogate).

Therefore, when Android/Java developers use String.charAt, be careful that you may get a leading surrogate or a trailing surrogate instead of a normal character in BMP.

For UTF-16, not only Chinese characters, but all characters in BMP need to be represented by two bytes (16 bit).

In JNI, some jstring related methods, names without UTF, basically use UTF-16, such as:

const jchar * GetStringChars(JNIEnv *env, jstring string, jboolean *isCopy);
Get the UTF-16 encoded byte stream of jstring.

jstring NewString(JNIEnv *env, const jchar *unicodeChars, jsize len);
Constructs a jstring using a UTF-16 encoded jchar stream.

UTF-8

UTF-8 code unit 8 bits, variable length encoding, it is also very simple, just refer to the following table:

UTF-8 encoding table

In UTF-8, Chinese characters that we commonly use are encoded as three bytes, because most of the Chinese characters are in U+0800~U+FFFF.

BOM

BOM (byte-order mark) is also a common term. For example, our code files are required to be saved in UTF-8 without BOM, otherwise it may fail to compile, or some strange things will occur.

BOM is the name of the Unicode character at code point U+FEFF.

For encoding forms such as UTF-16, UCS-2, UTF-32 / UCS-4 that are not 8-bit code units, when the encoded data is to be stored/transmitted, there must be a problem of byte order, whether store the most significant byte of a code unit sequence at the smallest memory address or the largest?

The BOM appears at the beginning of the byte stream and is used to identify the byte order and the Unicode character encoding form of the byte stream. Each encoding form encodes U+FEFF in its own way and byte order, place it in the beginning of byte stream to mark the byte order.

For example, when we know that the byte stream to be read is encoded in UTF-16, the byte order is unknown, the first two bytes are 0xFF 0xFE, U+FFFE in Unicode is not mapped to character, and these two ytes must be the encoded BOM U+FEFF, so it can be judged that the current byte stream uses little endian, that is, UTF-16 LE.

For UTF-8, since it uses 8-bit code units, there is no byte order problem, and it is not recommended to add BOM to the beginning of UTF-8 files, because it may affect some tools, so using UTF-8 without BOM becomes mainstream.

Byte order marks for different encodings:

+==============================================+==============+
| Encoding (BE: Big-endian; LE: Little-endian) | Hex form |
+==============================================+==============+
| UTF-8 | EF BB BF |
+----------------------------------------------+--------------+
| UTF-16 BE | FE FF |
+----------------------------------------------+--------------+
| UTF-16 LE | FF FE |
+----------------------------------------------+--------------+
| UTF-32 BE | 00 00 FE FF |
+----------------------------------------------+--------------+
| UTF-32 LE | FF FE 00 00 |
+----------------------------------------------+--------------+

Modified UTF-8

If your Android App uses C++ to process string, you may encounter a crash input is not valid Modified UTF-8. The rootcause for this crash is that the encoding is inconsistent. We usually use standard UTF-8, such as in Java code. However, the UTF related methods in JNI use the Modified UTF-8. Although they both have UTF-8 in their names, they are essentially two encoding methods, although some of them are compatible.

In JNI, all methods related to String, whose names have UTF, basically use MUTF-8 (Modified UTF-8). For example:

jstring NewStringUTF(JNIEnv *env, const char *bytes);
To construct jstring using MUTF-8 byte stream.

const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
Get the byte array of jstring encoded in MUTF-8.

But in Java, all string methods related to UTF-8 use standard UTF-8:

String.getBytes(“UTF-8”)
Get standard UTF-8 encoded byte array of the string

new String(bytes, “UTF-8”)
Constructs String using standard UTF-8 byte stream.

But the methods related to UTF in DataInput and DataOutput refer to MUTF-8.

Many developers do not know the difference between UTF-8 in the Java layer and the JNI layer, so once they encounter problems caused by inconsistent encoding, they will be very confused.

There are two differences between MUTF-8 and standard UTF-8.

  1. Null character (“\0”) is encoded as two bytes in MUTF-8: 0xC0 0x80 (i.e. 11000000 10000000). It is encoded as 0x00 (i.e. 00000000) in standard UTF-8.
  2. The characters in SP are firstly encoded as a surrogate pair in UTF-16 encoding, then the two surrogates are encoded in standard UTF-8 respectively.

Characters in BMP, except “\0”, are encoded in the same way in MUTF-8 and UTF-8.

Since the SP range is U+10000~U+10FFFF, the standard UTF-8 encodes the code points in SP as 4 bytes, while the code points in the surrogate area U+D800~U+DFFF are encoded as 3 bytes in standard UTF-8. Therefore, MUTF-8 requires 3+3 = 6 bytes to encode the code points in SP, which is 2 bytes more than standard UTF-8. Strings in Android dex files are encoded in MUTF-8.

References

https://en.wikipedia.org/wiki/ASCII
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/Unicode
https://en.wikipedia.org/wiki/UTF-16
https://en.wikipedia.org/wiki/Character_encoding

--

--