UTF-8 Explained — It’s Not 8 Bits Encoding, Nor 32 Bits Unicode

Brian NQC
4 min readJul 18, 2023

--

src: https://www.perl.com/article/building-a-utf-8-encoder-in-perl/

UTF-8, a widely utilized character encoding, is unfortunately subject to various misconceptions, even among experienced developers. Here are two common misunderstandings:

  • Some developers mistakenly believe that UTF-8/16/32 encodes characters using 8/16/32 bits respectively.
  • Another misconception is that UTF-8 is a Unicode Transformation Format, implying that it employs 32 bits for each character.

While understanding UTF-8 may not be crucial for writing many “good codes,” it becomes essential when dealing with strings in-depth. Let’s consider these 2 examples:

// Golang
vn := "Việt Nam"
fmt.Printf("%c", vn[2]) // Want: ệ but got: á
fmt.Printf("%c", vn[3]) // Want: t but got: »
// JAVA
final String iceCreams = "🍦🍧🍨";
System.out.println(iceCreams.charAt(0)); // Want: 🍦 but got: ?
System.out.println(iceCreams.charAt(1)); // Want: 🍧 but got: ?
System.out.println(iceCreams.charAt(2)); // Want: 🍨 but got: ?

To understand these issues, we need to know how UTF-8 works. But first, let’s refresh our mind with the basics.

Encoding 101: ASCII and Unicode

I’m sure you’re familiar with ASCII, typically introduced in IT101. Being an American Standard, ASCII primarily focuses on representing characters, digits, and punctuation marks commonly used by American people. Its concise and effective encoding requires only 23 bytes to represent the sentence: “I’m a Vietnamese person.” But, what if I want to write that same sentence in my mother tongue, “Tôi là người Việt Nam”?

Supporting non-English characters necessitates a larger character set, leading to the creation of Unicode. Unicode encompasses characters from all writing systems worldwide, including accents and many others, assigning each character a unique identifier called Unicode code point. As of July 2023, Unicode 15 defines nearly 150,000 characters. To represent all these code points, we require log2(150_000) = 17.2 bits, which unfortunately is slightly larger than the size of 2 bytes. Consequently, the natural data type to hold a single code point is int32, and that is precisely what UTF-32 does.

UTF8 — Efficient Unicode

Indeed, using a fixed length of 32 bits to represent every character, as done in UTF-32, ensures functional adequacy for character encoding. However, it comes at the cost of increased memory and storage consumption.

String s = "Unicode is amazing, Unicode thật tuyệt vời";
println(s.length()); // 42 chars
println(s.getBytes(StandardCharsets.UTF_8).length); // 48 bytes
println(s.getBytes(Charset.forName("UTF-32")).length); // 168 bytes

Furthermore, despite the vast number of characters that exist, some see far more frequent use than others. For instance, allocating 4 bytes to represent characters like 𒀀𒁞𒁭 in Sumerian cuneiform makes sense. However, for efficiency, it is more practical to use just 1 byte to represent ASCII characters such as A, B, C and 2 or 3 bytes for characters like â, ă, ê, 世, 界. This very consideration served as the driving force behind the creation of UTF-8.

UTF-8 — Variable-length encoding

UTF-8 employs a variable-length encoding scheme, utilizing 1 to 4 bytes to represent each Unicode character. Notably, commonly used characters benefit from shorter encodings. ASCII characters, in particular, require only 1 byte, ensuring that UTF-8 and ASCII encodings are identical for strings like s := "UTF-8 and ASCII encodings for this English sentence are identical."

For most characters in modern languages such as Tiếng Việt, 中文, ภาษาไทย, and عربي, UTF-8 typically uses 2 or 3 bytes. It judiciously reserves the use of 4 bytes for less frequent characters, optimizing memory usage without compromising on comprehensive character representation.

Decoding UTF-8

With UTF-8 being a variable-length encoding, determining the number of bytes used for each character during decoding follows a straightforward rule:

  • If the higher-order bit is 0, the character is represented by 7 bits, corresponding to standard ASCII characters.
  • If the higher-order bits are 1, the number of leading 1s indicates the number of bytes occupied by the character. For example, 110xxxxx, 1110xxxx, and 11110xxx signify characters taking 2, 3, and 4 bytes, respectively. Subsequent bytes of a multi-byte character always start with 10xxxxxx.

Let’s illustrate this with an example: Encoded string: 01000001_11000011_10000010_11100100_10111000_10101101_11110000_10011111_10001111_10000000

  • 01000001 → Starts with 0 → ASCII → “A”
  • 11000011_10000010 → Starts with 110 → 2 bytes → “”
  • 11100100_10111000_10101101 → Starts with 1110 → 3 bytes → “中”
  • 11110000_10011111_10001111_10000000 → Starts with 11110 → 4 bytes → “🏀”

This encoding rule leads to an important property of UTF-8: No character’s encoding is a subset of any other character’s encoding or even a sequence of them. For instance, any ASCII characters of the form 0xxxxxxx will never be found between 110xxxxx10xxxxxx...1110xxxxx10xxxxxx…11110xxxxx10xxxxxx…, ensuring a reliable and unambiguous decoding process.

Disadvantage

UTF-8 comes with several advantages, including its ability to represent Unicode characters using efficient space and seamless compatibility with ASCII. However, one drawback of UTF-8 is the lack of direct indexing, making it challenging to access the Nth character in a UTF-8 encoded string in O(1) time complexity. When scenarios demand frequent direct indexing, it may be beneficial to consider re-encoding the string using fixed-length encodings like UTF-32. This change can potentially improve direct indexing performance and streamline certain operations on the string. Conversely, some programming languages, like Golang, support operations such as character iteration in a UTF-8 encoded string. Nevertheless, it’s crucial to be aware that different languages utilize different encodings. When working with complex string encoding scenarios, carefully reading the documentation becomes imperative to avoid encountering unexpected “weird behaviors.”

zodiacs := "🐁🐃🐅🐈🐉🐍🐎🐐🐒🐓🐕🐖"
for i, r := range zodiacs {
fmt.Printf("%d\t%q\t%d\n", i, r, r)
}

/* Output

0 '🐁' 128001
4 '🐃' 128003
8 '🐅' 128005
12 '🐈' 128008
16 '🐉' 128009
20 '🐍' 128013
24 '🐎' 128014
28 '🐐' 128016
32 '🐒' 128018
36 '🐓' 128019
40 '🐕' 128021
44 '🐖' 128022

*/

If you find this article helpful, please motivate me with one clap. You can also checkout my other articles at https://medium.com/@briannqc and connect to me on LinkedIn. Thanks a lot for reading!

--

--

Brian NQC

Follow me for contents about Golang, Java, Software Architecture and more