An Introduction to Text Encoding

All about ASCII, Unicode and UTF-8

Kah Ho
7 min readMay 10, 2024
Photo by Domingo Alvarez E on Unsplash

What is Text Encoding?

Text Encoding is the conversion of a human-readable plain text, into a format more easily processible by digital applications. Whether it be the messages on your smartphone or the words on any website you see, before being displayed on your screen, they existed as binary data, in other words, 1s and 0s. The way your browser or computer determines how to convert these 1s and 0s into human-readable text is decided by the encoding standard that is used. The same binary data can be interpreted as different words if different methods of text encoding are used. Modern text encoding standards have allowed for languages other than just English to be processed and displayed by your devices, and the innovation of emojis would not have been possible without the modern standards of text encoding.

What was once known as the industry standard for text encoding would have been replaced several times by newer standards for text encoding, driven by an increasingly globalized world with more languages and system requirements to adapt to. Still, the most prominent ones to have garnered widespread adoption in modern history would be ASCII, Unicode and subsequently UTF-8, which has become the dominant encoding standard used today.

ASCII

ASCII, short for the American Standard Code for Information Interchange, first published in 1963, was the first text encoding standard adopted as the industry standard for text encoding. ASCII has since undergone several revisions, but at its core, each ASCII character uses 8 bits (also known as a byte) of information. Conventionally, given the constraint of 8 bits, the range of binary sequences you can access would be from 0000 0000 to 1111 1111, or 0 to 255, so you would expect there to be a maximum of 256 characters representable by ASCII. However, the ASCII standard that we know today actually does not use the largest bit, so the range of sequences only goes from 0000 0000 to 0111 1111, or 0 to 127.

The reason the most significant bit is not utilized is that during the inception of ASCII, older devices such as teleprinters only supported 7 bits of characters, so only using 7 bits would allow for better compatibility between digital devices. You may then wonder why ASCII still went with an 8-bit design, and this is attributed to a deliberate design consideration that allows for more efficient encoding of 2-digit encoding using binary-coded decimals. Additionally, the additional bit would open up the possibility of having a parity bit used for error checking.

However, even among these 128 codes supported by ASCII, only 95 are printable characters, as the codes from 0 to 31 inclusive, and code 127, do not represent text. Rather, these 32 codes represent control characters used in peripheral devices such as keyboards or printers to perform specific actions. For instance, as shown in the table below, the ASCII character with code 8 (0000 1000) is used to represent the control character ‘BS’ or backspace, that we are familiar with. ASCII code 32 (0010 0000) denoted with ‘SP’ represents the space character, and is the first printable character in the ASCII sequence, while code 126 (0111 1110) ‘~’ represents the tilde character which is the last printable character. Code 127 (0111 1111) ‘DEL’ is also a control character, not to be mistaken as a printable character.

ASCII Chart (Wikipedia)

Unicode

As you may have realized, the ASCII standard only supports English alphabet characters and not other languages. This became the main driving force for the inception and widespread adoption of the next text encoding standard — the Unicode Standard. Similar to ASCII, the Unicode system also assigns unique code points to each character, but in this case, the characters (namely printable characters) are no longer limited to the range of 0 to 127, which allows for characters from other languages to be part of this ecosystem of digital text processing.

To represent Unicode characters, instead of using decimals directly like the ASCII system, a hexadecimal (or hex) string is used instead. For some context, hexadecimal strings use the characters from 0–9 followed by A-F respectively to represent 0 to 16, where 0 is mapped to 0, and F is mapped to 16. With a hexadecimal system, each hex digit represents 4 bits of data, so 2 hex digits (1 byte) would be sufficient to represent any ASCII character.

Unicode characters are represented using the format ‘U+0123’, where the leading ‘U+’ denotes a Unicode character, and the subsequent 4 or more hex digits represent a binary value. For instance, ‘U+A041’ would equate to a Unicode character with the binary ‘1010 0000 0100 0001’ (remember, hex digit A is equivalent to decimal value 10). By design, the first 128 code points of the Unicode system map to the same 128 code points of the ASCII system, so the lowercase character ‘a’, with a code point of 97 for both ASCII and Unicode, in ASCII it would be represented as binary ‘0110 0001’, whilst the Unicode equivalent would be ‘U+0061’.

Due to Unicode’s sophisticated design, 1,114,112 code points can now be mapped to specific characters as opposed to that of ASCII which is only 127, allowing every character in any language to be represented. For instance, if I were to represent the Chinese character for ‘me’ which is ‘我’ as Unicode, it would use ‘U+6211’.

UTF-8

Next came the Unicode Transformation Format - 8-bit, or UTF-8. You may now wonder why a new standard is even necessary if Unicode already solves the issue of multilingualism. Well, if we look at Unicode’s definition, it does not specify the number of hex digits for people to use. As such, different systems can represent Unicode characters with differing numbers of hex digits, and by implication, use different amounts of memory resources to store the same information. Following the UTF-8 standard simply refers to using the Unicode format to represent text with a baseline of 8 bits. Similarly, UTF-16 and UTF-32 standards also exist, so what makes UTF-8 specifically the most popular standard to follow?

The biggest reason for its mass adoption can be attributed to backwards compatibility and memory efficiency. The UTF-8 standard works in units of 1 byte, so it is very compatible with systems that still use ASCII for text encoding. In terms of memory, using 1 byte to store data, as opposed to 2 and 4 bytes for UTF-16 and UTF-32 respectively just means that memory can be used more efficiently, reducing the total number of redundant bits required in most cases.

Now, if we were to use Unicode characters with code points that lie outside of the default 8-bit of memory that the UTF-8 standard uses, would this still be possible? The answer is yes. For characters such as this laughing with tears emoji 😂, it uses Unicode U+1F602 which has 5 hex digits instead of the default 4, so for cases such as this, systems using the UTF-8 standard actually allocate more bytes in increments of 1 byte, to provide up to 4 bytes of total memory space for such characters. Drawing comparisons, if we were to use the UTF-16 standard instead, for characters with code points falling outside of the 16-bit range, memory space can only be allocated in increments of 2 bytes.

With most of the internet’s web pages and modern applications using English, the occurrences of characters that can be represented using just 1 byte are much higher than that of characters with 2 bytes or more. As such, most digital systems adopt the UTF-8 standard because it allows for optimal memory efficiency for most cases, whilst providing support for other language types when necessary.

Limitations of Modern Text Encoding Standards

Despite the shifts in text encoding standards to meet ever-changing demands, modern text encoding standards still face limitations. Most prominently, if different applications do not conform to the same text encoding standards, you will likely face rendering issues. If you are an avid internet user, you may have come across characters that depict errors such as ‘□’ (U+25A1) or ‘�’ (U+FFFD). These are Unicode characters that represent missing, unsupported, or invalid printed characters. The cause of such errors may come from corrupted files or even browser compatibility issues, but the most common reason for these errors is a mismatch in text encoding standards. The only way to mitigate these errors is for developers to be more careful when developing multi-component applications, and ensure that the same standard is used throughout.

The other flaw with UTF-8 systems is the inherent bias towards English-based applications. English characters can all be easily represented using 8-bits, but the same may not be said for other languages, especially those with accented characters such as é (as you may recognize from Pokémon). To represent accented characters, Unicode combines base characters with modifiers, so for the case of é, it is e (U+0065) combined with ◌́ (U+0301). Because of this property of Unicode, texts that comprise of non-English characters likely end up taking up more memory space even if their lengths may be the same, so biases do exist for applications that use English. It is also because of this that even for applications developed for non-English-speaking users, some developers still consciously decide to use English instead.

However, despite these flaws, the standardization of text encoding has come a long way since the inception of ASCII, and much has been done in this space to promote inclusivity for users globally. As computing hardware and digital systems continue to improve, it is very much possible that the industry will begin to shift towards newer standards that meet changing demands. Perhaps a greater emphasis on information security will see a rise in the adoption of UTF-16, where each byte includes a parity bit for error detection, as unlikely as that may be.

--

--

Kah Ho
0 Followers

Software Developer. Technology Enthusiast. Hawker Food Fanatic. Building full-stack applications one syntax error at a time.