Encoding, the Doctolib Crash Course

Thibault “Adædra” Hamel
Doctolib
Published in
15 min readJul 12, 2019

If there’s a problem that hasn’t aged and is still plaguing software development today, it’s encodings. It’s not always obvious, but encodings affect all software you use and more specifically all software you write. Furthermore, it can break your code in very subtle ways and cause hard to reproduce bugs and major headaches. Worse, it can be dependent on your location, historical factors, your operating systems, and other factors that you may not be aware of.

Even today in 2019, just after the 30th birthday of Internet, this problem is still a nuisance in software.

Encodings all come from a simple idea: for a computer to be able to understand text, it must be represented as numbers, since computers only work on numeric values. Therefore, every character is assigned a numerical value. That seems simple enough, but when you have to take into account all of the different writing methods, alphabets, and other things that are text in addition to letters, things quickly get out of hand. Multiple solutions have been applied through history, and they all add up to the confusing mess we’ve inherited today. For an example, see the WhatWG encoding list, which describes which encodings a web browser should support.

Encodings are hard, even if you’re Twitter. This is supposed to be an `æ`.

I’m not telling you this article will magically fix all your encoding problems: it can’t. What I’m going to do however is give you some insights into what is happening, a little bit of history and other things that I’ve learned during my time as a developer which may be useful for you too next time you have an encoding issue.

Note that encodings are a very large and complex issue and I cannot cover everything in a single blog article. Also, this is mainly given from the point of view of a Western European, so examples may focus on encodings used in that part of the world.

This article contains some hexadecimal and binary representation of numbers. If you are not familiar with the subject, you may want to read a bit about them.

Why Encoding

Computers today are very powerful calculators capable of doing mathematical operations very efficiently. But they have a very dumb flaw: they only know how to work with numbers. Text is a completely alien thing to them, they don’t really understand that.

To let computers work with our way of writing, we must translate our text into numbers intelligible to them. Now, we can give text in the form of a series of numbers, and computers will happily crunch them.

Actually, this problem of transmitting text in another format was known before the advent of computers. The most famous example is possibly morse code, used to transmit text over a wire, which technically already used a binary encoding for characters. Other examples are the maritime flag code or semaphore code which also define an alphabet using another support, in this instance using flags.

Single byte encoding

ASCII

We’re going to skip a bit through the very first text encodings. The beginnings of IT were a bit chaotic, and many solutions were developed by different actors. Instead, we’re going to jump directly to the most enduring of them, ASCII, still relevant nowadays as it is used as a base for many other encodings.

ASCII encodes a set of 128 characters using 7 bits. These characters are all letters from the English alphabet, both uppercase and lowercase, plus numbers, punctuation signs, and a set of “control characters”, which are characters with a special meaning for the machine. For example, the character with the value 0x0A (10) is named `LF` or “Line Feed” and is used to represent a jump to the next line. (Line jumps are actually a bit more complicated but that’s another story altogether.)

The 7-bit format was chosen as it was the most efficient way of transmitting data. When represented with 8-bit words, characters would be padded with an extra 0.

Many ASCII tables are available for you to read. On UNIX-derived systems like Linux or macOS you can type man ascii to see such a table. However, the best format for the ASCII table is a 4-column format, like this, where we can see that uppercase and lowercase letters are just one column away:

│ 00 ┆ NUL │ 20 ┆     │ 40 ┆  @  │ 60 ┆  `  │
│ 01 ┆ SOH │ 21 ┆ ! │ 41 ┆ A │ 61 ┆ a │
│ 02 ┆ STX │ 22 ┆ " │ 42 ┆ B │ 62 ┆ b │
│ 03 ┆ ETX │ 23 ┆ # │ 43 ┆ C │ 63 ┆ c │
│ 04 ┆ EOT │ 24 ┆ $ │ 44 ┆ D │ 64 ┆ d │
│ 05 ┆ ENQ │ 25 ┆ % │ 45 ┆ E │ 65 ┆ e │
│ 06 ┆ ACK │ 26 ┆ & │ 46 ┆ F │ 66 ┆ f │
│ 07 ┆ BEL │ 27 ┆ ' │ 47 ┆ G │ 67 ┆ g │
│ 08 ┆ BS │ 28 ┆ ( │ 48 ┆ H │ 68 ┆ h │
│ 09 ┆ HT │ 29 ┆ ) │ 49 ┆ I │ 69 ┆ i │
│ 0A ┆ NL │ 2A ┆ * │ 4A ┆ J │ 6A ┆ j │
│ 0B ┆ VT │ 2B ┆ + │ 4B ┆ K │ 6B ┆ k │
│ 0C ┆ NP │ 2C ┆ , │ 4C ┆ L │ 6C ┆ l │
│ 0D ┆ CR │ 2D ┆ - │ 4D ┆ M │ 6D ┆ m │
│ 0E ┆ SO │ 2E ┆ . │ 4E ┆ N │ 6E ┆ n │
│ 0F ┆ SI │ 2F ┆ / │ 4F ┆ O │ 6F ┆ o │
│ 10 ┆ DLE │ 30 ┆ 0 │ 50 ┆ P │ 70 ┆ p │
│ 11 ┆ DC1 │ 31 ┆ 1 │ 51 ┆ Q │ 71 ┆ q │
│ 12 ┆ DC2 │ 32 ┆ 2 │ 52 ┆ R │ 72 ┆ r │
│ 13 ┆ DC3 │ 33 ┆ 3 │ 53 ┆ S │ 73 ┆ s │
│ 14 ┆ DC4 │ 34 ┆ 4 │ 54 ┆ T │ 74 ┆ t │
│ 15 ┆ NAK │ 35 ┆ 5 │ 55 ┆ U │ 75 ┆ u │
│ 16 ┆ SYN │ 36 ┆ 6 │ 56 ┆ V │ 76 ┆ v │
│ 17 ┆ ETB │ 37 ┆ 7 │ 57 ┆ W │ 77 ┆ w │
│ 18 ┆ CAN │ 38 ┆ 8 │ 58 ┆ X │ 78 ┆ x │
│ 19 ┆ EM │ 39 ┆ 9 │ 59 ┆ Y │ 79 ┆ y │
│ 1A ┆ SUB │ 3A ┆ : │ 5A ┆ Z │ 7A ┆ z │
│ 1B ┆ ESC │ 3B ┆ ; │ 5B ┆ [ │ 7B ┆ { │
│ 1C ┆ FS │ 3C ┆ < │ 5C ┆ \ │ 7C ┆ | │
│ 1D ┆ GS │ 3D ┆ = │ 5D ┆ ] │ 7D ┆ } │
│ 1E ┆ RS │ 3E ┆ > │ 5E ┆ ^ │ 7E ┆ ~ │
│ 1F ┆ US │ 3F ┆ ? │ 5F ┆ _ │ 7F ┆ DEL │

You may have already gone from one column to another by adding/removing 32 (or 0x20): it corresponds to just flipping the 6th bit on or off. For example, `K` is 0x4B (75), and if you add 0x20 you obtain 0x6B (107), `k`.

You can also see that all control characters except `DEL` are in the first column.

Note: An interesting fact, is that these control characters have an alternate writing using a `^` followed by a letter, corresponding to the Control key plus the corresponding letter. To find the associated letter, just use the 3rd column of the table.

For example, `LF`, the UNIX line ending, can also be represented as `^J`. If you type `Control`+`J` on a UNIX shell, it will have the same effect as typing Return. Also, if you `cat -v` a file with Windows line endings, you’ll see `^M` at the end of each line, which corresponds to `CR`, used in Windows line endings in pair with `LF`.

So if characters may seem placed a bit randomly, they definitely are not. The ASCII table had several revisions during its history, shuffling or changing some characters around, but it is now a fixed standard.

However, while ASCII is good for English, it does not suit other languages, as it lacks accented characters that many other languages use. Countries developed their own 7-bits encodings replacing some characters from US-ASCII to local accented characters, often replacing characters in the 3rd and 4th column — keeping the lower-to-uppercase rule. However, as it may suit regular users, it was a problem with developers which used the replaced characters like curly brackets ({}) regularly, and often led to wrongly displayed text when the stored text’s encoding did not match their document viewer’s.

DOS Code pages

Instead of changing the characters from the ASCII standard, another alternative was to extend it. Using just an extra bit, the number of possible characters goes from 128 to 256, which gives plenty of space for other characters. Also, as computers at this time moved to architectures using a 8-bit long byte, a 8-bit coding was indeed coherent.

I’m not going to go in detail in all code pages — that would be way too long — but will present some of the major cases.

In the IBM world, character sets are known as “code pages”. Code pages have an assigned number to refer to them. The first interesting code page is the CP437, also known as “Extended ASCII”. It was the standard encoding for IBM computers of that time. It is still today used by some computers during the initialisation phase before any other font is loaded.

IBM’s CP437 extended ASCII by using the extra 8th bit, and replacing the control characters by some visual characters. It added support for:

  • Box drawing characters (allowing some graphical effects in text mode)
  • Accented letters for western and northern Europe
  • Greek
  • Some other miscellaneous characters

The CP437 encoding, despite including some accented characters, was not designed to be used in western and northern Europe. In those countries, another code page was used, CP850. It replaces some of the box drawing characters by additional accented characters like ø or  to provide better support for European languages. However, for programs that use these box drawing characters, it would affect the render on computers using the CP850 page.

CP437 on the left, CP850 on the right.

CP850 was primarily used in Western and Northern Europe as well as Canada (the US using mainly CP437). Those code pages are still partially used by some Windows programs with DOS-era ties and some data format from this time.

Many other regions had their own code pages to support their local alphabets’ quirks, like CP737 for Greece or CP852 for central Europe.

Mac OS Roman

Apple, with Mac OS, took a completely different direction and developed their own encoding, called “Mac Roman”.

As for DOS codepages, it is based on ASCII and only defines characters for the added bit. It lacks box drawing characters but includes some added mathematical and accents, as well as the Apple logo.

ISO 8859

The multiplication of encodings led the ISO organisation, responsible for developing international standards, to emit a new standard called ISO/IEC 8859.

The ISO 8859 standard had the mission of providing standardised character encodings — character tables — for various languages. Given the large amount of characters that need encoding, one table was not enough, and the standard nowadays defines 15 encodings numbered from 1 to 16 (the 12th one was abandoned), and are officially designed as ISO/IEC 8859–1 to ISO/IEC 8859–16. Of these 15 tables, 10 are based on the Latin alphabet.

For example, ISO 8859–1 is defined for western European languages, and ISO 8859–15 is an evolution of this table to include some more useful characters (e.g. œ replaced ¼) and add the Euro (€) sign. ISO 8859–5 and ISO 8859–7 provide, respectively, the Cyrillic and Greek alphabets.

These encodings leave the positions 00 to 1F and 7F to 9F as undefined (the first block, as well as 7F, are defined as control characters in ASCII). However, the ISO-8859 encodings (yes, a different series, note the added dash) keep the definition of ASCII control characters and add a new series of control characters in the 80–9F zone.

The ISO 8859–1 and its brother, ISO 8859–15 (also often referred as latin1 and latin9) are well-spread in Western Europe.

Windows-1252

Even with ISO’s normalisation of encodings, there are still some exceptions. Under Windows, ISO 8859–1 was disregarded in favour to to a slightly different encoding, usually called Windows-1252.

Windows-1252 extends ISO 8859–1 by placing characters in the second empty area, between values 80 and 9F. Some of the added characters will be also added to ISO 8859–15, like ÿ or €, but at different places, making the two encodings incompatible.

Outside of the Windows world, this encoding is not used much, in favour of ISO 8859 encodings.

Multi-bytes encoding

Using full 8-bits characters allowed to extend to 256 available characters, but it is still far from enough to cover all languages. As written above, the ISO 8859 standard only covers a small subset of all worldwide writing systems, and even with that restricted subset it required 15 different tables to do the job (even if some overlap).

Even if you limit your needs to only one writing script, a 8-bit space is sometimes not enough. For example, East Asian languages usually contain a very large amount of characters, and sometimes even multiple writing systems. East Asian languages required development of different encoding systems using multiple-byte characters to be able to encode the large number of glyphs.

The BMP block seen below in the article. Note the proportion of East Asian and CJK characters (China-Korea-Japan common characters) compared to other scripts.

I’m not going to cover those encodings — it would require an article on its own given how vast the subject is. Eastern Asian countries developed several encodings to cover their use cases, usually nationally, which makes interoperability between Asian languages difficult.

In the West, the answer to this problem of multiple competing encodings was the Universal Coded Character Set, or UCS, also known as the ISO 10646 standard.

This new standard assigns each character a numeric value in a single, larger space, allowing for little more than 1 million different signs. Latin, Cyrillic, Greek and other alphabets are all in the same table. Of course, this requires a larger space, and UCS encodes characters beyond the 256 characters limit. Today, the UCS standard encodes more than 136,000 different signs (UCS defines more than just characters. some are special ones like joining characters and invisible spaces, and refer to them as “signs”).

The new standard is split into different zones to group different character kinds together. One of these groups is the Basic Multilingual Plane (BMP), a 65,536-sign long zone containing all the characters needed to encode modern languages, including Eastern Asia ones.

UCS-2 & UCS-4

The UCS standard simply defines numeric values for each known character. Different characters encodings are able to encode these characters, and one of the simplest ones is UCS-2. It simply encodes each 16-bit point on two bytes.

UCS-2 can encode up to 65,536 different characters, meaning it can single-handedly encode the entirety of the Basic Multilingual Plane alone. However, it suffers from two shortcomings: going beyond the BMP requires composing with other encodings through escape sequences. Escape sequences (ISO 2022) are used in many East Asian encodings to switch between local alphabets. The same principle can be used to switch out and in UCS-2 and work with another encoding for these shortcomings.

The other problem, is that each character is now encoded over two bytes. For standard English text, half of the bytes will be null bytes and the required storage space will be artificially doubled for characters which would have taken a single byte in their “native” encoding, which is not very efficient.

UCS-4 operates under the same principle but over four bytes, and is able to represent each and every character defined in the ISO 10646 standard and more. However, this exacerbates the inefficiency UCS-2 already suffers with each character now taking up to 4 times the size it would have in its local encoding.

UTF-16

In an attempt to deal with the upper-limit of 65536 characters in UCS-2, a new standard was developed, UTF-16. In this standard, each character normally takes up 2 bytes of storage, but can be extended to 4 bytes for characters outside of the standard BMP plane. By doing so, UTF-16 breaks a convention in previously shown encodings that all characters have to be the same size.

A | 0x00 0x41
Д | 0x04 0x14
🃑 | 0xD8 0x3C 0xDC 0xD1

Note: The last character is an ace of clubs. Yes, Unicode defines playing cards as characters. Domino and Mahjong tiles, too.

The benefit of this evolution is that you keep wasted space to a minimum. You use the full 4 bytes only when you need exotic characters outside the BMP, which should be a rare occurrence.

Using multiples bytes unveiled another problem: byte ordering. Depending on computer architectures, numbers are stored by your computer from left to right (called little-endian) or from right to left (big-endian). Going from one architecture to the other would mean characters would have to be translated from one writing method to the other.

                               UTF16-BE  UTF-16LE
00 48 00 65 00 6c 00 6c 00 6f Hello 䠀攀氀氀漀

For this, there are two solutions: either indicating with the encoding the byte order (differentiating UTF-16BE and UTF-16LE) or indicating in the document the byte order with a marker. That’s the role of the BOM (Byte-Order Mark), a special character that should be placed in the beginning of UTF-16 documents to indicate the byte order in the document.

Note that the BOM is often associated with Unicode in general, as it is sometimes inserted (without need) in UTF-8 documents.

UTF-8

Another standard took hold, called UTF-8. In the same fashion as UTF-16, UTF-8 characters are of variable width, ranging from 1 to 4 bytes.

A | 0x41
Д | 0xD0 0x94
곴 | 0xEA 0xB3 0xB4
🃑 | 0xF0 0x9F 0x83 0x91

UTF-8 has an advantage over UTF-16: All characters from the standard ASCII table are still coded over 8 bits, maintaining compatibility with ASCII. It means that a valid ASCII text will be the same text in UTF-8. It also means that for a text consisting of only ASCII characters, it will take the same space, unlike with UTF-16.

Beyond these characters, multiple bytes will be needed to encode a value. Characters in the BMP will need 2 to 3 bytes. With the UTF-8 structure, it is possible to encode 1,114,111 different code points. With this structure, UTF-8 is capable of encoding all characters, current and future, defined in the Unicode standard.

UTF-8 also brings other improvements over UTF-16, notably the fact that it is easy to find a character boundary in a stream of text. It means it is possible, despite having different character widths, to scan through text at random positions and be able to quickly find where the next valid character lies. Also, if the text is corrupted and a chunk is missing, it is possible to find the next valid character and resume processing. In UTF-16, losing an odd number of bytes would lead to a completely invalid and unrecoverable text.

Since its inception, UTF-8 has quickly grown to become the most popular encoding worldwide. Its ability to encode the whole set of unicode characters, as well as its advantages over other Unicode character encodings like UTF-16 made it a worthy encoding to replace legacy encodings. Despite the numbering, UTF-8 proved itself to be better than UTF-16 in general use.

In 2008, Google reported that UTF-8 became the most popular encoding for HTML files, and it continues to gain popularity as the default encoding around the world, across operating systems, software and countries.

Conclusion

As I said earlier, I have just scratched the surface, there are a lot of other character encodings, some generic, some specific, some proprietary or limited to a handful of systems or software. Nothing prevents you from doing things the hard way and creating your own encoding, if you want to add to the mess.

However, thanks to the Unicode standard and UTF-8, things are getting easier to manage. We now have a default encoding to use that will cover all use cases, and as it is becoming more and more widespread, we can now more safely assume a document is UTF-8, and revert to a local encoding if some characters are invalid in this setting (while other encoding often not have invalid characters).

Unicode itself is a mighty beast, and will need another article coming after this one to cover it.

In another future article I will write about the current state of affairs with respect to encoding and some handy tools and practices for working with multiple encodings and debugging weird text encoding.

--

--