Introduction to Unicode equivalence and normalization

Wan Xiao
7 min readJul 2, 2022

--

This article mainly introduces the concepts related to Unicode equivalence and Unicode normalization.

Previously, I wrote Introduction to character encoding, which introduced the knowledge about UTF-16 surrogate pair and modified UTF-8. You can easily introduce some weird issues and crashes if you don’t pay attention to these concepts. The concepts of Unicode equivalence and Unicode normalization introduced in this article are also very important for some string processing software.

It is recommended to read Introduction to character encoding before reading this article. The basic knowledge and nouns mentioned in the previous article will not be described again here.

What is Unicode equivalence

Unicode equivalence means that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets.

The equivalent characters are similar or identical.

There are two types of Unicode equivalence, canonically equivalent and compatible.

Canonically equivalent

Canonically equivalent code points sequences has the same appearance and meaning, whether in print or on display. For example, U+006E (n), followed by a U+0303 ( ̃ ), is canonically equivalent to U+00F1(ñ) in Unicode. These two sequences are exactly the same when displayed, and can be replaced by each other whether in character sorting or searching. Similarly, Korean syllable blocks can also be replaced with sequences of 2 to 3 code points.

Compatible

Compatible code points sequences may appear differently when displayed, but in some contexts have the same meaning. For example, U+FB00 (ff) is compatible to the sequence U+0066 U+0066 (two f’s), not canonically equivalent. Compatible sequences may be used/replaced in the same way in some applications, but not in other cases.

Canonically equivalent is stricter than compatible and it is a subset of compatible equivalent.

Canonically equivalent tmust be compatible equivalent, but compatible equivalent is not necessarily canonically equivalent.

Why Unicode equivalence is needed

Character duplication

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For exmaple the character “Å” can be encoded as U+00C5 (standard name: LATIN CAPITAL LETTER A WITH RING ABOVE) or as U+212B (standard name: ANGSTROM SIGN). Most symbols have only a single codepoint. The code points of truly identical characters which can be rendered in the same way in Unicode fonts are defined to be canonically equivalent.

Combining and precomposed characters

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters, such as U+00F1 (ñ) or U+00C5 (Å). Or as combinations of two or more characters, such as U+FB00 (ff) or U+0132 (IJ).

For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. For example U+3099 (My Chrome can’t render this character, please see the picture).

U+3099

が (U+304C) can be seen as か (U+304B) followed by a U+3099. There are many more such examples, the following are only some of them:

Characters combined with U+3099

In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.

In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur. But there are exceptions, you can refer to the Canonical ordering subsection below

Typographic conventions

Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons, such as ligatures like U+FB00 (ff) or U+0132 (IJ), the half-width katakana characters, or the full-width Latin letters for use in Japanese texts, or to add new semantics without losing the original one such as digits in subscript or superscript positions, or the circled digits such as U+2460 (①) inherited from some Japanese fonts.

Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.

Normalization

A text processing software implementating the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent. This unique sequence of code points is called the normalization form / normal form of the original text.

There are two normal forms, one is called fully composed, that is, as many code points as possible are replaced with a single code point; the other is fully decomposed, which is to decompose a single code point into multiple code points as much as possible.

Since the equivalence criteria can be either canonical or compatibility, so Unicode provides four normalization algorithms:

Four normalization algorithms

In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results.

For instance some typographic ligatures like U+FB03 (ffi), Roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters “ffi”. At the same time, because it is compatible equivalence, not canonical equivalence, it will no longer be recomposed by canonical equivalence. So a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 (ffi) but not in NFC normalization of U+FB03 (ffi). Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman numeral Ⅸ (U+2168). Similarly the superscript U+2075 (⁵) is transformed to “5” (U+0035) by compatibility mapping.

All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.

They do not satisfy injection, that is, after normalization, they cannot be converted back to the previous form, let alone bijection. For example U+212B (Å) and U+00C5 (Å) will be decomposed by NFD and NFKD to U+0041 (A) U+030A ( ◌̊), and then recomposed by NFC or NFKC to U+00C5 (Å).

The normal forms are not closed under string concatenation. That is, even if two strings X and Y are normalized, their string concatenation X+Y is not guaranteed to be normalized. This even happens in NFD, because accents are canonically ordered, and may rearrange around the point where the strings are joined. Consider the string concatenation examples shown below:

How concatenation break normalized string

Canonical ordering

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.

For example U+1EBF ( ế ). Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 ( ̂ ) U+0301 ( ́ ), the combining class for the U+0302 ( ̂ ) and U+0301 ( ́ ) are both 230, thus U+1EBF ( ế ) is not equivalent to U+0065 (e) U+0301 ( ́ ) U+0302 ( ̂ ).

So not all combination sequences have equivalent precomposed characters, such as U+0065 (e) U+0301 ( ́ ) U+0302 ( ̂ ) can only be composed as U+00E9 ( é ) U+0302 ( ̂ ) .

Unicode standard defines the canonical decomposition mapping , the canonical decomposition form of each character can be found in it.

What should we pay attention to

When comparing and searching strings, we need to pay attention to Unicode equivalence. Even in Chinese, there are equivalent situations. For example, U+2F8A6 (慈) is canonical equivalent to U+6148 (慈).

In addition, I know that in Thai, there are a lot of combing characters, so the Thai input method can input these characters. In Korean, Japanese may have the same case. When processing strings, you need to pay attention to judging the length of the string, which may not necessarily match the length of the string that the user actually sees. When sending user input to the backend, be aware of possible problems with Unicode equivalence.

When two applications share Unicode data, but normalize them differently, errors and data loss can result.

To be honest, I haven’t encountered the problem of Unicode equivalence yet, so I can’t give readers much warning and help. But understanding the content of this article should help you identify and solve problems caused by Unicode equivalence.

Reference link

https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt
https://minaret.info/test/normalize.msp
https://r12a.github.io/app-conversion/
https://www.compart.com/en/unicode/combining

--

--