Everything that you need to know about Unicode and UTF-8
It looks the same! But is it the same?
Here is a question. Will `kāśī == kāśī` return True
or False
? Stop and think about it for a moment.
Well, they both look the same so the answer must be True and well it is.
x = "kāśī"
y = "kāśī"print(x == y)
True
But there’s a catch
x = "kāśī"
y = "kāśī"print(x == y)
False
So what just happened? Examining both the strings, we get an interesting insight.
x = "kāśī"
print(len(x))
4
y = "kāśī"
print(len(y))
7
Let’s make things more concrete by considering just a single glyph.
x = "ā"
print(len(x))
1
y = "ā"
print(len(y))
2
Both the strings have different lengths. But how is it possible?
To answer this question, we need to dive into character encodings, Unicode, and UTF-8.
Strings to Bytes and Bytes to String
We need to convert the strings to bytes to see the real difference. Python `str` have the encode
method to convert str
to bytes
. Similarly, bytes
have the decode
method to convert bytes
to str
.
print(x.encode())
b'\xc4\x81'
print(y.encode())
b'a\xcc\x84'
So here’s where the fun begins. The bytes differ in the two. So when we loop over the strings, we find that the actual characters in the strings differ.
for i in x:
print(i)
ā
for i in y:
print(i)
a
̄
But when we decode the bytes
back to str
, we get the strings as-is.
print(b"\xc4\x81".decode())
ā
print(b"a\xcc\x84".decode())
ā
Okay, it isn’t very clear and does not make any sense. Let’s jot down the confusing points:
- Both strings look the same.
- Both strings have different underlying representations.
- Why are we looking at the
bytes
instead of directly working withstr
? - Why do we get different bytes for the same string?
- How can two strings be different but look the same?
- Why did we get
True
earlier butFalse
later?
Character Encodings
The answer to all the above questions lies in what we call character encoding. Fundamentally, computers deal with numbers. More specifically, they deal just with binary numbers 0 and 1 known as bits grouped together into a byte of 8 bits. So we are left with the simple task of assigning a unique number to each character essentially creating a one-to-one mapping between each character and a number.
ASCII (American Standard Code for Information Interchange) was developed by ANSI (American National Standards Institute) which encoded 128 characters using a 7-bit number from 0x00 to 0x7F. “A” is assigned 65 (0x41). “0” is assigned 48 (0x30). In ASCII, each byte represents one character. A string of length n will require n bytes of space. Both str.encode
and bytes.decode
methods have encoding parameters in order to facilitate working.
help(str.encode)# Help on method_descriptor:
# encode(self, /, encoding='utf-8', errors='strict')
# Encode the string using the codec registered for encoding.
#
# encoding
# The encoding in which to encode the string.
# errors
# The error handling scheme to use for encoding errors.
# The default is 'strict' meaning that encoding errors raise a
# UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
# 'xmlcharrefreplace' as well as any other name registered with
# codecs.register_error that can handle UnicodeEncodeErrors.help(bytes.decode)# Help on method_descriptor:
# decode(self, /, encoding='utf-8', errors='strict')
# Decode the bytes using the codec registered for encoding.
# encoding
# The encoding with which to decode the bytes.
# errors
# The error handling scheme to use for the handling of decoding # errors.
# The default is 'strict' meaning that decoding errors raise a
# UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
# as well as any other name registered with codecs.register_error that
# can handle UnicodeDecodeErrors.print(" ".join((f"{hex(i)}" for i in "PYTHON".encode())))
0x50 0x59 0x54 0x48 0x4f 0x4e
print(" ".join((f"{i:08b}" for i in "PYTHON".encode())))
01010000 01011001 01010100 01001000 01001111 01001110
This works very elegantly. But it will only work as long as there are no characters beyond the English alphabet. There was no way of representing characters of other languages in that system. Similarly, other countries would develop their own encodings. The mappings are more or less arbitrary. German systems can assign “Ä” to 65 (0x41) and it works as expected on their machines.
A stream of bytes does not mean anything in isolation. It requires an encoding to be actually parsed correctly. 65 for ASCII would be completely different from 65 for CP1258. Matters worsen when languages like Chinese are considered which have over 70000 characters and can’t be stored using 1 byte.
print(0xe3)
227
print(b"\xe3".decode("cp720")) # Arabic
ع
print(b"\xe3".decode("cp1255")) # Hebrew
ד
print(b"\xe3".decode("cp1258")) # Vietnamese
ă
print(b"\xe3".decode("latin_1")) # Western Europe
ã
The full list of codecs supported by Python is listed here: https://docs.python.org/3/library/codecs.html
Unicode
The problems listed above were solved by the formation of the Unicode standard which would go on to incorporate characters from a lot of languages over the years. To be precise, Unicode encodes scripts of languages rather than languages themselves thus making it agnostic to the underlying language.
In order to really appreciate the genius and simplicity of the system, we need to take a closer look.
The first task is to gather all the characters and assign unique numbers sequentially. This number assigned is known as Unicode Code Point. The code point of a character can be queried with ord
function. The interesting thing to note here is that Unicode maintains compatibility with ASCII since ASCII was used almost ubiquitously in English-speaking countries. Thus the ASCII value is equal to the Unicode Code Point for those characters. ASCII ends at 127 while Unicode can have over 1 million characters.
help(ord)# Help on built-in function ord in module builtins:
# ord(c, /)
# Return the Unicode code point for a one-character string.print(ord("A"))
65
print(ord("ã"))
227
print(ord("ñ"))
241
Similarly, chr
is the inverse function of ord
and it returns the character associated with the particular code point.
help(chr)# Help on built-in function chr in module builtins:
# chr(i, /)
# Return a Unicode string of one character with ordinal i; 0 <= i <= 0x10ffff.print(chr(3746))
ຢ
print(chr(22472))
埈
print(chr(65))
A
print(chr(2325))
क
This solves a major problem of portability. Now the systems that support Unicode do not have to worry about wrong interpretation and parsing. Each code point is uniquely associated with a character so there are no ambiguities. A thing to be very specifically noted here is that the code point refers to a unique integer assigned to a character. It is not concerned with how the bytes are arranged in memory. Unicode supports a variety of characters including accents, signs, and emojis.
So in the above case, “ā” can be input in two ways:
- A single character “ā”
- A single character “a” followed by an overbar character to form “ā”
Both of them are different strings but the fonts render the same character. This abstract character formed by a sequence of one or more code points is known as a Unicode glyph. Thus kāśī
can be formed with a combination of a lot of different code points. They all represent the same glyph but are different strings completely. To make things more concrete, here is an example in the devanāgarī script. "की" appears to be a single character but it is in fact formed by the addition of two characters "क" (2325) and “ी” (2368).
So the answer to the question ‘Is
kāśī
==kāśī
?' is neither True nor False but "Maybe!"
There is a way to solve this problem. One must always first use the unicodedata.normalize
function to normalize the text. There are 4 available options. The ones that end in C result in a composed form while the ones that end in D result in a decomposed form. It doesn’t matter which one we use as long it is used uniformly in the entire process.
import unicodedatahelp(unicodedata.normalize)# Help on built-in function normalize in module unicodedata:
# normalize(form, unistr, /)
# Return the normal form 'form' for the Unicode string unistr.
# Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.x = "kāśī"
y = "kāśī"print(x == y)
False
print(unicodedata.normalize("NFC", x) == unicodedata.normalize("NFC", y))
True
print(unicodedata.normalize("NFD", x) == unicodedata.normalize("NFD", y))
True
print(len(unicodedata.normalize("NFC", y)))
4
print(len(unicodedata.normalize("NFD", x)))
7
UTF-8
The real engineering starts here. Once we assign a number to a character, how do we represent it in bytes? Going back to ASCII which is an 8-bit encoding, every character takes up 1 byte. Since Unicode stores code points as high as a million, we require around 4 bytes. Now one easy way would be to encode everything using 32-bits. This would be an absolute waste of space. Now suddenly, all the files are 4 times the original size. Also, there would be a lot of long sequences of zeros. For example, the ASCII representation of “A” is 01000001. If the earlier mentioned system was to be followed, the Unicode representation would be 00000000 00000000 00000000 01000001. Many systems work in a way where receiving an ASCII NUL (00000000) would signal the end of transmission. Also just like ASCII, we cannot directly use the binary representation of the code point because since a number can be arbitrarily small or large, there would be no way to know the boundary of a character as the system would not know how many bytes to read before actually parsing them into a character.
So a system was needed which would be able to store binary values of code points using the minimum number of bytes required and would also store how much further to read without needing any extra mechanism which would be a waste of both space and time. At the same time, it had to be compatible with ASCII encoding. Along with this, there should never be a series of NUL bytes unless explicitly required. To tackle all these issues, a new encoding was created known as the Unicode Translation Format — 8-bit or simply UTF-8.
The UTF-8 encoding works as follows:
- If the character has only one byte, the first bit is set to 0.
- If the character has more than one byte, the first byte starts with the number of 1 equal to the number of bytes required to encode the code point followed by a 0. The remaining bytes all start with the bits 10.
- All the remaining bits are set to the binary representation of the code point and padding them with the necessary number of 0.
This simple yet genius trick means that one simply needs to find just the last header byte to know which byte they are reading and find the next header byte to know the word boundary. Nowhere in this sequence will we ever have a NUL byte unless sent explicitly. This is backward compatible with ASCII and thus all the ASCII text is automatically compatible with Unicode. The examples of how code points are converted to bytes are enlisted in UTF-8 Wikipedia. If one knows the hexadecimal code point, we can get the character.
print("\u00fe") # U+FE
þ
print("\u0915") # U+915
क
print("\u5e8f") # U+5E8F
序
print("\U0001f925") # U+1f925
🤥
One can get bytes
from str
and get str
from bytes
.
print("序".encode("utf-8"))
b'\xe5\xba\x8f'
print("पूर्व्यांश".encode("utf-8"))
b'\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xb5\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\x82\xe0\xa4\xb6'
print(
b"\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xb5\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\x82\xe0\xa4\xb6".decode(
"utf-8"
)
)
पूर्व्यांश
print("🌹".encode("utf-8"))
b'\xf0\x9f\x8c\xb9'
print(b"\xf0\x9f\x8c\xb9".decode("utf-8"))
🌹
print("Python3".encode("ascii").decode("utf-8"))
Python3
print("Python3".encode("utf-8").decode("ascii"))
Python3
Published at Fri May 27 12:38:07 AM IST 2022 by https://www.kaggle.com/dhruvildave