Everything that you need to know about Unicode and UTF-8

Dhruvil Dave
8 min readMay 26, 2022

--

It looks the same! But is it the same?

Here is a question. Will `kāśī == kāśī` return True or False? Stop and think about it for a moment.

Asking where two strings that look the same are the same?

Well, they both look the same so the answer must be True and well it is.

x = "kāśī"
y = "kāśī"
print(x == y)

True

But there’s a catch

x = "kāśī"
y = "kāśī"
print(x == y)

False

Python shell comparing two strings

So what just happened? Examining both the strings, we get an interesting insight.

x = "kāśī"
print(len(x))

4

y = "kāśī"
print(len(y))

7

Let’s make things more concrete by considering just a single glyph.

x = "ā"
print(len(x))

1

y = "ā"
print(len(y))

2

Both the strings have different lengths. But how is it possible?

To answer this question, we need to dive into character encodings, Unicode, and UTF-8.

Strings to Bytes and Bytes to String

We need to convert the strings to bytes to see the real difference. Python `str` have the encode method to convert str to bytes. Similarly, bytes have the decode method to convert bytes to str.

print(x.encode())

b'\xc4\x81'

print(y.encode())

b'a\xcc\x84'

So here’s where the fun begins. The bytes differ in the two. So when we loop over the strings, we find that the actual characters in the strings differ.

for i in x:
print(i)

ā

for i in y:
print(i)

a
̄

But when we decode the bytes back to str , we get the strings as-is.

print(b"\xc4\x81".decode())

ā

print(b"a\xcc\x84".decode())

Okay, it isn’t very clear and does not make any sense. Let’s jot down the confusing points:

  1. Both strings look the same.
  2. Both strings have different underlying representations.
  3. Why are we looking at the bytes instead of directly working with str ?
  4. Why do we get different bytes for the same string?
  5. How can two strings be different but look the same?
  6. Why did we get True earlier but False later?

Character Encodings

The answer to all the above questions lies in what we call character encoding. Fundamentally, computers deal with numbers. More specifically, they deal just with binary numbers 0 and 1 known as bits grouped together into a byte of 8 bits. So we are left with the simple task of assigning a unique number to each character essentially creating a one-to-one mapping between each character and a number.

ASCII (American Standard Code for Information Interchange) was developed by ANSI (American National Standards Institute) which encoded 128 characters using a 7-bit number from 0x00 to 0x7F. “A” is assigned 65 (0x41). “0” is assigned 48 (0x30). In ASCII, each byte represents one character. A string of length n will require n bytes of space. Both str.encode and bytes.decode methods have encoding parameters in order to facilitate working.

help(str.encode)# Help on method_descriptor:

# encode(self, /, encoding='utf-8', errors='strict')
# Encode the string using the codec registered for encoding.
#
# encoding
# The encoding in which to encode the string.
# errors
# The error handling scheme to use for encoding errors.
# The default is 'strict' meaning that encoding errors raise a
# UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
# 'xmlcharrefreplace' as well as any other name registered with
# codecs.register_error that can handle UnicodeEncodeErrors.
help(bytes.decode)# Help on method_descriptor:

# decode(self, /, encoding='utf-8', errors='strict')
# Decode the bytes using the codec registered for encoding.

# encoding
# The encoding with which to decode the bytes.
# errors
# The error handling scheme to use for the handling of decoding # errors.
# The default is 'strict' meaning that decoding errors raise a
# UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
# as well as any other name registered with codecs.register_error that
# can handle UnicodeDecodeErrors.
print(" ".join((f"{hex(i)}" for i in "PYTHON".encode())))

0x50 0x59 0x54 0x48 0x4f 0x4e

print(" ".join((f"{i:08b}" for i in "PYTHON".encode())))

01010000 01011001 01010100 01001000 01001111 01001110

This works very elegantly. But it will only work as long as there are no characters beyond the English alphabet. There was no way of representing characters of other languages in that system. Similarly, other countries would develop their own encodings. The mappings are more or less arbitrary. German systems can assign “Ä” to 65 (0x41) and it works as expected on their machines.

A stream of bytes does not mean anything in isolation. It requires an encoding to be actually parsed correctly. 65 for ASCII would be completely different from 65 for CP1258. Matters worsen when languages like Chinese are considered which have over 70000 characters and can’t be stored using 1 byte.

print(0xe3)

227

print(b"\xe3".decode("cp720")) # Arabic

ع

print(b"\xe3".decode("cp1255")) # Hebrew

ד

print(b"\xe3".decode("cp1258")) # Vietnamese

ă

print(b"\xe3".decode("latin_1")) # Western Europe

ã

The full list of codecs supported by Python is listed here: https://docs.python.org/3/library/codecs.html

Unicode

The problems listed above were solved by the formation of the Unicode standard which would go on to incorporate characters from a lot of languages over the years. To be precise, Unicode encodes scripts of languages rather than languages themselves thus making it agnostic to the underlying language.

In order to really appreciate the genius and simplicity of the system, we need to take a closer look.

The first task is to gather all the characters and assign unique numbers sequentially. This number assigned is known as Unicode Code Point. The code point of a character can be queried with ord function. The interesting thing to note here is that Unicode maintains compatibility with ASCII since ASCII was used almost ubiquitously in English-speaking countries. Thus the ASCII value is equal to the Unicode Code Point for those characters. ASCII ends at 127 while Unicode can have over 1 million characters.

help(ord)# Help on built-in function ord in module builtins:

# ord(c, /)
# Return the Unicode code point for a one-character string.
print(ord("A"))

65

print(ord("ã"))

227

print(ord("ñ"))

241

Similarly, chr is the inverse function of ord and it returns the character associated with the particular code point.

help(chr)# Help on built-in function chr in module builtins:

# chr(i, /)
# Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.
print(chr(3746))

print(chr(22472))

print(chr(65))

A

print(chr(2325))

This solves a major problem of portability. Now the systems that support Unicode do not have to worry about wrong interpretation and parsing. Each code point is uniquely associated with a character so there are no ambiguities. A thing to be very specifically noted here is that the code point refers to a unique integer assigned to a character. It is not concerned with how the bytes are arranged in memory. Unicode supports a variety of characters including accents, signs, and emojis.

So in the above case, “ā” can be input in two ways:

  1. A single character “ā”
  2. A single character “a” followed by an overbar character to form “ā”

Both of them are different strings but the fonts render the same character. This abstract character formed by a sequence of one or more code points is known as a Unicode glyph. Thus kāśī can be formed with a combination of a lot of different code points. They all represent the same glyph but are different strings completely. To make things more concrete, here is an example in the devanāgarī script. "की" appears to be a single character but it is in fact formed by the addition of two characters "क" (2325) and “ी” (2368).

So the answer to the question ‘Is kāśī == kāśī?' is neither True nor False but "Maybe!"

There is a way to solve this problem. One must always first use the unicodedata.normalize function to normalize the text. There are 4 available options. The ones that end in C result in a composed form while the ones that end in D result in a decomposed form. It doesn’t matter which one we use as long it is used uniformly in the entire process.

import unicodedatahelp(unicodedata.normalize)# Help on built-in function normalize in module unicodedata:

# normalize(form, unistr, /)
# Return the normal form 'form' for the Unicode string unistr.

# Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
x = "kāśī"
y = "kāśī"
print(x == y)

False

print(unicodedata.normalize("NFC", x) == unicodedata.normalize("NFC", y))

True

print(unicodedata.normalize("NFD", x) == unicodedata.normalize("NFD", y))

True

print(len(unicodedata.normalize("NFC", y)))

4

print(len(unicodedata.normalize("NFD", x)))

7

The answer is “Maybe!”

UTF-8

The real engineering starts here. Once we assign a number to a character, how do we represent it in bytes? Going back to ASCII which is an 8-bit encoding, every character takes up 1 byte. Since Unicode stores code points as high as a million, we require around 4 bytes. Now one easy way would be to encode everything using 32-bits. This would be an absolute waste of space. Now suddenly, all the files are 4 times the original size. Also, there would be a lot of long sequences of zeros. For example, the ASCII representation of “A” is 01000001. If the earlier mentioned system was to be followed, the Unicode representation would be 00000000 00000000 00000000 01000001. Many systems work in a way where receiving an ASCII NUL (00000000) would signal the end of transmission. Also just like ASCII, we cannot directly use the binary representation of the code point because since a number can be arbitrarily small or large, there would be no way to know the boundary of a character as the system would not know how many bytes to read before actually parsing them into a character.

So a system was needed which would be able to store binary values of code points using the minimum number of bytes required and would also store how much further to read without needing any extra mechanism which would be a waste of both space and time. At the same time, it had to be compatible with ASCII encoding. Along with this, there should never be a series of NUL bytes unless explicitly required. To tackle all these issues, a new encoding was created known as the Unicode Translation Format — 8-bit or simply UTF-8.

The UTF-8 encoding works as follows:

  • If the character has only one byte, the first bit is set to 0.
  • If the character has more than one byte, the first byte starts with the number of 1 equal to the number of bytes required to encode the code point followed by a 0. The remaining bytes all start with the bits 10.
  • All the remaining bits are set to the binary representation of the code point and padding them with the necessary number of 0.
Table of Unicode code points to bytes

This simple yet genius trick means that one simply needs to find just the last header byte to know which byte they are reading and find the next header byte to know the word boundary. Nowhere in this sequence will we ever have a NUL byte unless sent explicitly. This is backward compatible with ASCII and thus all the ASCII text is automatically compatible with Unicode. The examples of how code points are converted to bytes are enlisted in UTF-8 Wikipedia. If one knows the hexadecimal code point, we can get the character.

print("\u00fe") # U+FE

þ

print("\u0915") # U+915

print("\u5e8f") # U+5E8F

print("\U0001f925") # U+1f925

🤥

One can get bytes from str and get str from bytes.

print("序".encode("utf-8"))

b'\xe5\xba\x8f'

print("पूर्व्यांश".encode("utf-8"))

b'\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xb5\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\x82\xe0\xa4\xb6'

print(
b"\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xb5\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\x82\xe0\xa4\xb6".decode(
"utf-8"
)
)

पूर्व्यांश

print("🌹".encode("utf-8"))

b'\xf0\x9f\x8c\xb9'

print(b"\xf0\x9f\x8c\xb9".decode("utf-8"))

🌹

print("Python3".encode("ascii").decode("utf-8"))

Python3

print("Python3".encode("utf-8").decode("ascii"))

Python3

Published at Fri May 27 12:38:07 AM IST 2022 by https://www.kaggle.com/dhruvildave

--

--

Dhruvil Dave

Kaggle 2x Master | Machine Learning | Classical Music | NLP | saṃskṛta