Things you should know about Unicode

You might already have heard about one of those terms — ASCII, UTF-8, UTF-16, Unicode — without exactly knowing what they were dealing with.

Unicode characters might seem magic

Before diving into all this complexity, I will briefly explain the history of encoding throughout the ages…


Quick story

All started in 1960 with ASCII — American Standard Code for Information Interchange. ASCII is a 7 bits encoding able to encode 128 characters among which 95 are printable.

Source: Wikipedia FR — ASCII

With the soaring of computers and the need to encode more and more languages it has been extended by a variety of so called Extended ASCII. There are not one but multiple Extended ASCII. The idea behind this change was to use the bit that was not used by ASCII to encode 128 additional characters while keeping the first 128 characters unchanged.

So far so good, users in America and Europe were able to encode all their characters simply by using the right charset. But in reality it raised some issues when exchanging messages across different regions.

For instance, if you wanted to share a document written using Latin-9 and featuring the euro symbol — € — to someone using Latin-1, she or he would have received another character instead — ¤.


Unicode to the rescue

While Extended ASCII looked good and useful, many issues were still unsolved:

  • not all languages can be encoded using 8-bits
  • characters are ambiguous: character 0xA4 is either “unspecified currency” — in Latin-1 — or “euro” — in Latin-9

With those drawbacks in mind, Unicode 1.0.0 was released in 1991. It introduced a strict distinction between:

  • code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange
  • code point: A value, or position, for a character, in any coded character set
  • grapheme: What a user thinks of as a character

Source: Glossary — http://www.unicode.org/glossary/

Unicode defines a range of code points going from 0x0000 to 0x10FFFF with some of them still undefined and others reserved. It can define up to 1,114,112 code points.

In addition to bringing a universal and unambiguous definition of a given code point, Unicode also defines its name, rules to combine it with others and gives a visual reference of it.

In terms of implementation, Unicode can be implemented by different character encodings. It suggests three encodings — UTF-8, UTF-16 and UTF-32 — but others are also commonly used in the industry.

For instance, in Unicode the EURO SIGN has the code point 0x20AC but ways to represent it differ from one encoding to another:

  • One code unit / One byte: 0xA4 in Latin-9
  • Three code units / Three bytes: 0xE2 0x82 0xAC in UTF-8
  • One code unit / Two bytes: 0x20AC in UTF-16

Major encodings

Unicode would be nothing without ways to encode it. It only defines the correspondance of a code point — that can be seen like an index or row number — to a name, visual representation… not how to store it in memory.

This is the reason why Unicode needs encodings. Encoding is like a mapping from an in-memory “code unit” to the code point it represents and vice-versa.

In the list below I explicitly chose to add the size of one code unit for each of the encodings. It is important to underline that depending on the encoding a code point can be represented by multiple code units — often denoted Character in common languages.

UTF-8

  • Code unit is 1 byte long,
  • From 1 to 4 bytes per code point,
  • No byte ordering issue — BOM explained later,
  • Backward compatible with ASCII — characters in the range 0x00 to 0x7F are encoded as in ASCII and on a single byte,
  • The standard for communication between services

UCS-2

  • Understand a subset of Unicode: Only compatible with the BMP plan of Unicode — from 0x0000 to 0xFFFF,
  • Code unit is 2 bytes long,
  • Always 2 bytes long,
  • Some forbidden characters called surrogate characters— from 0xD800 to 0xDFFF,
  • Byte ordering required

UTF-16

  • Code unit is 2 bytes long,
  • Renewed UCS-2,
  • Backward compatible with UCS-2 — for the BMP plan,
  • Either 2 or 4 bytes,
  • Byte ordering required

UTF-32

  • Code unit is 4 bytes long,
  • Always 4 bytes long,
  • Byte ordering required

Encodings above constitute the most frequent ones in western countries. Nonetheless, it worth mentioning the existence of GB 18030 which is a very commonly found encoding in China.

More on encodings on https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings


Byte Order Mark

But what is this unexplained “byte ordering” / “byte order mark”? And why is it so important?

Byte Order Mark or BOM is a marker added at the beginning of a file to specify the endianness used when storing the document. Basically when storing data structures longer than one byte on a storage, there are two ways to do it:

  • Big endian notation of “1” (int32) is 0x00 0x00 0x00 0x01
  • Little endian notation of “1” (int32) is 0x01 0x00 0x00 0x00

As a consequence any encoding using more than one byte to store a character — not to be confused with code point — has to come with a BOM in order to help the reader understanding which character they wanted to use.

Indeed in UTF-16 both 0x00 0x01 and 0x01 0x00 are valid code units and there is no way — except the BOM — to select which one has been encoded.

The second usage of BOM is helping the reader to distinguish which encoding has been used. For that reason even if unnecessary for UTF-8 encoding, using it will tell the user that this is a UTF-8 content — unnecessary for UTF-8 because code units are one byte long.

Table below is an example mixing both encoding and byte ordering on three code points declared by Unicode:

Some examples of how characters can be encoded

As stated when describing the second usage of BOM, it differs from one encoding to another. Here is the list of those markers:

  • UTF-8: 0xEF 0xBB 0xBF
  • UTF-16 (big): 0xFE 0xFF
  • UTF-16 (little): 0xFF 0xFE
  • UTF-32 (big): 0x00 0x00 0xFE 0xFF
  • UTF-32 (little): 0xFF 0xFE 0x00 0x00

Character in common languages

While the topic of encoding and Unicode was at least partially covered — we will come back to it just after — an important problem remains: what do I use everyday when running code? Sending emails?…

It worth noting that most data transmission — file storage or internet communication — use UTF-8 encoding.

Most of modern languages — C#, Java, JavaScript, Scala — chose UTF-16 to encode code points. C/C++ encode characters on a single byte. In this part, we will focus on the UTF-16 choice. But it also applies to all choices that require multiple code units to encode a single code point.

One of the problems related to this choice is that neither of those languages properly handles characters outside of the first 65,536 first characters.

For instance in JavaScript, the CAT FACE code point defined at 0x1f431 is considered to be of length two, can be split… Basically most of string methods do not handle this code point as expected:

Most of the time the choice has been to handle at code unit level and not code point level. It leads to some strange issues with � characters popping everywhere because of bad unicode code point handling.


Additional traps you need to worry about

Before closing this story and because many traps remain, this last section gives a quick insight on some of those remaining traps.

Lower/Upper case

The lower or upper case value of a character might need more storage than the original value. For instance, the upper case value of the character ‘ß’ is ‘SS’ which is two characters long.

Moreover, the lower or upper case value of a character depends on the locale. A same character could have different upper case values depending on the locale used by the computer.

Composable characters

Unicode also defines how we can compose characters together. It can be used to build known and already defined code points or build new ones.

Composable characters

The two notations above are totally equivalent for Unicode standard but they are considered different in most programming languages.

In UTF-16, the first notation ‘\u0061\u0308’ has a length of 2 char points while ‘\u00e4’ has a length of 1. For instance, in JavaScript the two strings are considered different even if they represent the same grapheme.

Graphemes

After code units which can be seen as the minimal unit of storage for a character. After code points which can be seen as the definition of a character representation, way to be composed… Let’s have a few words about graphemes. Graphemes are the result of the combination of one or multiple code points to form complex ones.

The feature of composable characters perfectly allow to define new characters based on known code points. For instance, nothing forbid the definition of a ‘a’ character having multiple ¨ on top of it even if it does not have a single code point equivalent:

‘a’ character with three ¨ on top of it: ‘\u0061\u0308\u0308\u0308’

This feature is also highly used by the standard to define emojis. Indeed most of them result from the combination of multiple code units:

  • a man farmer is the combination of code points ‘man’ + ‘farmer’
  • a woman farmer is the combination of code points ‘woman’ + ‘farmer’

More on emojis at http://unicode.org/reports/tr51/


Conclusion

One of the most important distinction to keep in mind when dealing with characters in general is the difference between code units, code points and graphemes.

It is also important to understand and know how the programming languages you might be using encode your code points and deal with them. Depending on the language and the methods, they can either act at code unit level or code point level. Or even grapheme level if you are lucky enough.

Please leave a clap or a comment if you liked this article ;)

Additional readings: