Character Encoding: ASCII, Unicode, UTF-8

Yakup Cengiz
7 min readSep 6, 2023

Imagine a scenario where Bob wants to send a message to Alice, but their computers use different methods to represent letters. Bob’s computer uses the number 1 to represent the letter “A”, while Alice’s computer uses the number 2. This inconsistency means that if Bob sends Alice the message “Hello, Alice!”, Alice’s computer will not be able to understand it. The solution to this problem lies in using a character encoding standard. A character encoding standard is a way of representing letters, numbers, and other characters as numbers. This article will explain character encoding in terms of three basic concepts: ASCII, Unicode and UTF-8.

Image created by bing showing Bob and Alice communication related to the topic of the article

ASCII and Code Points

ASCII (American Standard Code for Information Interchange) is one of the earliest character encoding standards in computing history. Developed in the 1960s, it uses 7 bits to represent characters, resulting in 128 possible combinations. These 128 code points map to various characters, including letters, numbers, punctuation, and control characters. However, only 95 of these code points represent printable characters, limiting ASCII’s scope. The basic idea of ASCII was to provide a common ground for character encoding in early computer systems.

In character encoding, a code point is a numeric value assigned to a specific character. In ASCII, each character corresponds to a specific code point. For example, ‘H’ is represented by code point 72, ‘E’ by code point 69, and so on. ASCII’s limited character set was sufficient for early computers but not for the world’s various writing systems.

The Rise of Unicode

As information technology expanded globally, a more comprehensive character encoding standard was needed. This led to the development of Unicode, a text encoding standard maintained by the Unicode Consortium. Unicode defines a large number of code points, officially denoted by hexadecimal numbers starting with ‘U+’. For example, the Unicode code point for ‘H’ is written as ‘U+0048’. Unicode can represent characters, scripts and symbols in a wide variety of languages, making it a global standard for character encoding.

Word cloud

Code Pages

Code page is a character encoding that associates a set of printable characters and control characters with unique numbers. Each number typically represents the binary value in a single byte. While ASCII and Unicode are well-known character encoding standards, code pages act as mediators to bridge the gap between these standards and older systems.

Code pages provide a specific mapping of characters to numeric values, allowing computers to correctly interpret and display text within the limits of encoding schemes. Essentially, code pages act as translators, ensuring that characters are correctly rendered on systems that do not fully support Unicode.

Here are a few commonly used code pages:

Code Page 1252 (Windows-1252): This code page, also known as “Windows-1252” or “Western European,” is widely used for encoding characters in the Latin-1 supplement and Latin-1 extended-A character sets. It includes characters used in many Western European languages, such as English, French, Spanish, and German.

Code Page 1250 (Windows-1250): This code page, known as “Windows-1250” or “Central European,” is used for encoding characters in the Central European and Eastern European languages, including Polish, Czech, Hungarian, and Croatian.

Code Page 1254 (Windows-1254): “Windows-1254,” also called “Turkish,” is used for encoding characters in the Turkish language.

UTF-8: A Universal Solution

Several attempts were made to address the Unicode compatibility issue, such as UCS2 and UTF-16. However, the most widely adopted solution in recent years is UTF-8, which stands for Universal Character Set Transformation Format 8-bit.

How UTF-8 Works

UTF-8 (Unicode Transformation Format 8-bit) is one of the most widely used encoding schemes for representing Unicode characters. UTF-8 uses a variable-length encoding approach that allows it to efficiently represent characters from different scripts and languages using 1, 2, 3 or 4 bytes.

To understand how UTF-8 works, it is important to understand the concept of leading bytes and trailing bits. The encoding of a character in UTF-8 starts with a leading byte, which indicates the number of bytes used for that character. The number of high-order bits set to 1 in the leading byte tells us how many bytes follow.

For example, for a 2-byte character, the leading byte starts with 110xxxxx, where 'x' represents bits derived from the character's code point. The trailing bytes start with 10xxxxxx, indicating that they carry the remaining bits of the code point.

When encoding a character into UTF-8:

  • For a 1-byte character (U+0000 to U+007F), it uses a single byte where the high-order bit is 0, followed by the 7 bits of the character’s code point.
  • For a 2-byte character (U+0080 to U+07FF), it uses a leading byte with 110xxxxx and a trailing byte with 10xxxxxx.
  • For a 3-byte character (U+0800 to U+FFFF), it uses three bytes with 1110xxxx, 10xxxxxx, and 10xxxxxx.
  • For a 4-byte character (U+10000 to U+10FFFF), it uses four bytes with 11110xxx, 10xxxxxx, 10xxxxxx, and 10xxxxxx.

In each sequence, ‘x’ represents bits derived from the character’s code point. The initial bytes (those starting with ‘110’, ‘1110’, or ‘11110’) indicate how many bytes are used to encode the character, and the subsequent bytes (those starting with ‘10’) hold the actual bits from the code point.

To illustrate the concept of UTF-8 encoding, here’s a Python code snippet that converts Unicode characters into their corresponding UTF-8 representations:

def unicode_to_utf8(char):
# Check if the character is within the 1-byte Unicode range (U+0000 to U+007F)
if 0x0000 <= char <= 0x007F:
# 1-byte character: U+0000 to U+007F
utf8_data = bytes([char])

# Check if the character is within the 2-byte Unicode range (U+0080 to U+07FF)
elif 0x0080 <= char <= 0x07FF:
# 2-byte character: U+0080 to U+07FF
# Calculate the leading byte and trailing byte values
leading_byte = 0xC0 | ((char >> 6) & 0x1F) # Leading byte starts with 110xxxxx
trailing_byte = 0x80 | (char & 0x3F) # Trailing byte starts with 10xxxxxx
# Combine the leading and trailing bytes
utf8_data = bytes([leading_byte, trailing_byte])

# Check if the character is within the 3-byte Unicode range (U+0800 to U+FFFF)
elif 0x0800 <= char <= 0xFFFF:
# 3-byte character: U+0800 to U+FFFF
# Calculate the leading bytes and trailing byte values
leading_byte1 = 0xE0 | ((char >> 12) & 0x0F) # Leading byte 1 starts with 1110xxxx
leading_byte2 = 0x80 | ((char >> 6) & 0x3F) # Leading byte 2 starts with 10xxxxxx
trailing_byte = 0x80 | (char & 0x3F) # Trailing byte starts with 10xxxxxx
# Combine the leading and trailing bytes
utf8_data = bytes([leading_byte1, leading_byte2, trailing_byte])

# Check if the character is within the 4-byte Unicode range (U+10000 to U+10FFFF)
elif 0x10000 <= char <= 0x10FFFF:
# 4-byte character: U+10000 to U+10FFFF
# Calculate the leading bytes and trailing byte values
leading_byte1 = 0xF0 | ((char >> 18) & 0x07) # Leading byte 1 starts with 11110xxx
leading_byte2 = 0x80 | ((char >> 12) & 0x3F) # Leading byte 2 starts with 10xxxxxx
leading_byte3 = 0x80 | ((char >> 6) & 0x3F) # Leading byte 3 starts with 10xxxxxx
trailing_byte = 0x80 | (char & 0x3F) # Trailing byte starts with 10xxxxxx
# Combine the leading and trailing bytes
utf8_data = bytes([leading_byte1, leading_byte2, leading_byte3, trailing_byte])

else:
return None # Character is outside the valid Unicode range

return utf8_data

# Example: Convert Unicode characters to UTF-8
unicode_chars = [0x65, 0x03B1, 0x144c, 0x1f600] # Some sample Unicode characters
for char in unicode_chars:
utf8_data = unicode_to_utf8(char)
if utf8_data:
print(f"Unicode U+{char:04X} to UTF-8 Encoding: {utf8_data.hex()}")
else:
print(f"Character U+{char:04X} is outside the valid Unicode range.")

The provided code converts Unicode characters to UTF-8 encoding, as demonstrated in the output below:

Unicode U+0065 to UTF-8 Encoding: 65
Unicode U+03B1 to UTF-8 Encoding: ceb1
Unicode U+144C to UTF-8 Encoding: e1918c
Unicode U+1F600 to UTF-8 Encoding: f09f9880

To further illustrate the concept of UTF-8 encoding and its relationship with Unicode, let’s examine a 4 bytes UTF-8 character.

The first image below shows a 4-byte UTF-8 character ‘f09f9880’ and its corresponding byte representation. Each byte is represented in hexadecimal format. In this example, ‘f0,’ ‘9f,’ ‘98,’ and ‘80’ are the four bytes that make up the UTF-8 encoding of the character. Understanding how characters are encoded into bytes is fundamental when working with character encoding in Python.

The second image below illustrates a Unicode character, ‘U+1F600,’ along with its UTF-8 and UTF-32 byte representations.

  • The UTF-8 representation ‘f09f9880’ shows how this character is encoded using UTF-8, with each byte in hexadecimal form.
  • The UTF-32 representation ‘0001F600’ displays the full 32-bit representation of the Unicode code point for the character.

--

--