Why Encoding Matters: A Primer on Bits and Bytes

🐻TheDev
6 min readOct 27, 2016

--

UTF-8 accounted for 87.7% of all Web pages in October 2016.

Looking at this chart leads to an obvious conclusion: encoding doesn’t matter; everyone uses UTF-8, and so should you.

NOT TRUE!!! Encoding does matter.

However, that’s simply not the case. In order to grasp the full significance of encoding, we have to go back to the old days, the Pre-UTF-8 Era.

Remember when you tried to talk to somebody over email or IM and you couldn’t figure out why the person received your messages just fine but replied with garbled text?

Those were the dark ages of encoding. Communication could easily break down, and it would be hard for the average user to figure out what exactly had gone wrong. This problem resulted from using different encodings to read byte sequences. To understand the crux of this issue, let’s look at two key terms:

  • Bit: a basic unit of information, commonly represented as either 0 or 1
  • Byte: a fixed-length sequence of bits, most commonly consisting of 8 bits

Thus, a byte came to be represented as 2⁸ = 256 numbers, ranging from 0 to 255 (or 00000000 to 11111111 in bits). The bits in a byte were given indices. The right-most bit (also referred to as the least significant) received an index of 0, and the left-hand most bit (also referred to as the most significant) was given an index of 7. This is still the case today, btw!

Check out this simple chart that shows you decimals in hexadecimal and binary form.

Using binary (or hexadecimal) notation was cumbersome and inefficient, to say the least. There was a need for some encoding standard. This is where things got weird, because, as you’ve probably guessed, the meaning of raw bit sequences changed depending on the encoding used.

Enter ASCII

Most modern character-encoding schemes are based on ASCII (although they support many additional characters).

ASCII is a character encoding standard which uses 7 bits to represent 128 specified characters based on the English alphabet. Numbers between 0 and 32 were reserved for control characters that dictated how the data should be interpreted and represented. They were designed for printing control, data structuring, and transmission control. All unaccented English characters use numbers between 32 and 127.

The key here is that ASCII was designed for unaccented English characters. Despite the fact that this encoding only uses 7 bits (leaving an entire bit unused), ASCII cannot be easily used for many other languages.

For example, Asian alphabets have thousands of letters, which were never going to fit into 8 bits. Many other alphabets use accents such as ˆ, ´, ¨, and so on. Therefore, just as the United States developed ASCII, other countries created their own encodings. This led to an incongruent mix of various code standards that were based on the needs and nuances of different languages and warranted transcoding.

Essentially, different countries had varying encoding standards. When computers came along, it was impossible to effectively exchange information between software tailored for different standards.

The same bits, under different encoding standards, became different characters.

Then came ISO/IEC 8859 — an encoding which was backward-compatible with 8-bit ASCII and consisted of 191 characters from the Latin script. It included various accents that allowed for complete coverage of Afrikaans, Corsican, Faroese, Norwegian, and many others.

Nevertheless, ISO/IEC 8859 only encoded the first 256 Unicode characters using single-byte fixed length encoding, which meant that some languages still remained incompatible.

For example, consider ISO/IEC 8859–5 which was designed for Cyrillic alphabets:

ISO/IEC 8859–5 designed to cover languages that use a Cyrillic alphabet.

And GB 18030 — the official character set of the People’s Republic of China:

Information technology — Chinese coded character set defines the required language and character support necessary for software in China.

Although Cyrillic alphabets were integrated in ISO/IEC 8859, it did not support the Chinese writing system.

Enter Unicode

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.

The latest version of Unicode (9.0) contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts, as well as multiple symbol sets. Unicode has the following key properties:

  • Unicode code point: a single value in the Unicode code space with a possible value from 0 to 0x10FFFF (17 × 65,536 = 1,114,112 code points)
  • Private Use Area (PUA): a range of code points that is intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments

Unicode can be implemented by different character encodings, however, most commonly used ones are:

  • UTF-8: a 8-bit variable-width encoding which maximizes compatibility with ASCII
  • UTF-16: a 16-bit, variable-width encoding (uses a minimum of 2 bytes and is therefore incompatible with ASCII)

The first 256 Unicode code points match the characters found in Latin-1 (ISO/IEC 8859–1). Moreover, UTF-8 is backward-compatible with 7-bit ASCII as it can use 1 byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding.

So why does encoding matter..?

String is a fundamental type in many programming languages. Encoding is the way characters are stored into memory. Adding two and two together —

Encoding matters because it affects the way strings, which are collections of characters, are stored in memory.

Let’s talk about Unicode equivalence. It’s the notion that some sequences of code points represent essentially the same character. This feature allows for compatibility with preexisting standard character sets. There are two distinct notions of Unicode equivalence, but we’ll be focusing on one — Canonical Equivalence which states that two code point sequences are assumed to have the same appearance and meaning when printed or displayed.

U+006E followed by U+0303 is canonically equivalent to U+00F1, that is, the character n followed by ˜ is equivalent to ñ

Another example would be emoji where multi-person groupings are composed from several individual member emoji joined with the zero-width joiner.

"👩🏾".characters.count // 2
"👨‍👨‍👧‍👧".characters.count // 4
"👩\u{200D}👩\u{200D}👦\u{200D}👦" == "👩‍👩‍👦‍👦" // true

As you can see above, in Swift 3.0, the character count of the two emoji is different (if you’re wondering why 👩🏾 has a count of 2, it’s because one character represents the face/person while another character is used for skin tone).

This introduces the idea that working with canonically equivalent sequences of code points can lead to unexpected outcomes, or bugs. While this example is undeniably niche, it highlights the importance of encoding in programming today.

Now, for an even more nuanced example, consider a different scenario. Imagine you have a relatively low cap on an object’s size in bytes. A QR code, for example. This type of 2D barcode can store a maximum of 2953 8-bit characters. Would it matter which encoding we use?

Yes. Remember, ASCII is a fixed-width encoding, while in Unicode has numerous ways of storing characters in byte sequences. Thus, the same word in two different encodings will have two different sizes, depending on the encoding used.

To sum it all up, bits, bytes, and canonical equivalence are three concepts that every programmer should know.

Additional Resources

Tom Scott passionately breaks down encoding.
Binary is so 01100011 01101111 01101111 01101100.

--

--