Basics of Javascript · String · normalize() (method)

Published in

Nerd For Tech

5 min readJun 8, 2021

This article is a transcript of my free youtube series about basics of web development. If you prefer watching over reading, feel free to visit my channel “Dev Newbs”.

Hi to all of my fellow developers! We are going to mess with diacritic marks once again. Today’s daily special is method normalize() that tinkers with different ways of representing the same character. Let’s begin!

The normalize() method returns the Unicode Normalization Form of the string.

The only parameter we need to provide is a form. Form can have one of these 4 values:

NFC (Canonical Decomposition, followed by Canonical Composition)
NFD (Canonical Decomposition)
NFKC (Compatibility Decomposition, followed by Canonical Composition)
NFKD (Compatibility Decomposition)

If you omit the value or you specify it as undefined, default value “NFC” is used instead.

That’s a lot of abbreviations. Let’s check the first two in example 1.

let comped =   '\u0045\u006c\u0020\u004e\u0069\u00f1\u006f';
let decomped = '\u0045\u006c\u0020\u004e\u0069\u006e\u0303\u006f';comped + " (" + comped.length + ")"              // El Niño (7)
decomped + " (" + decomped.length + ")"          // El Niño (8)
comped == decomped                               // falselet compedNFC = comped.normalize('NFC');
let decompedNFC = decomped.normalize('NFC');compedNFC + " (" + compedNFC.length + ")"        // El Niño (7)
decompedNFC + " (" + decompedNFC.length + ")"    // El Niño (7)
compedNFC == decompedNFC);                       // truelet compedNFD = comped.normalize('NFD');
let decompedNFD = decomped.normalize('NFD');compedNFD + " (" + compedNFD.length + ")"        // El Niño (8)
decompedNFD + " (" + decompedNFD.length + ")"    // El Niño (8)
compedNFD == decompedNFD                         // true

Whole point of normalization is that we can write the same character with its dedicated code point value, but we can also combine multiple other basic characters to get the specific one. It’s exactly the situation of the “ñ” letter in the spanish phrase “El Niño”. We can either write it using dedicated unicode value of “\u00f1”, or using two separate unicode values for basic letter “n” plus diacritic mark — in this case it is “\u006e” for letter “n” and “\u0303” for the diacritic mark tilde above letter “n”.

You and I as a consumer of the result can not visually tell the difference, but at the moment, we start using the resulting string for searching or sorting, we might get something we did not bargain for. And for that reason we have the normalize() method, which allows us as developers to choose which way we want the resulting string to look, so we can properly handle it in our code.

The first part shows that the two strings are not considered equal and they even have different lengths.

Second part applies canonical composition to both variables. We get two new variables which are both equal and use shorter composed version for the special character.

The third part applies the opposite principle — canonical decomposition to both original variables and we get yet another two new variables. This time, they both use longer decomposed version of the special character.

I used the term “canonical” without a more detailed explanation of what exactly it means. Let’s fix that now. In unicode, two sequences of code points have “canonical equivalence” if they represent the same abstract characters and they always should have the same visual appearance and behavior — for example being sorted in the same way.

With “NFC” and “NFD” forms arguments, we can guarantee that two canonically equivalent strings will also be represented by the same strings that are equal to each other.

Then, there is a “compatibility normalization”. In Unicode, two sequences of code points are compatible if they represent the same abstract characters, and should be treated alike in some — but not necessarily all — applications.

let str1 = '\uFB00';
let str2 = '\u0066\u0066';str1 + " (" + str1.length + ")")                // ﬀ (1)
str2 + " (" + str2.length + ")"                 // ff (2)
str1 === str2                                   // falselet str1_NFKC = str1.normalize('NFKC');         
let str2_NFKC = str2.normalize('NFKC');         str1_NFKC + " (" + str1_NFKC.length + ")"       // ff (2)       
str2_NFKC + " (" + str2_NFKC.length + ")"       // ff (2)
str1_NFKC === str2_NFKC                         // truelet str1_NFKD = str1.normalize('NFKD');
let str2_NFKD = str2.normalize('NFKD');str1_NFKD + " (" + str1_NFKD.length + ")"       // ff (2)     
str2_NFKD + " (" + str2_NFKD.length + ")"       // ff (2)
str1_NFKD === str2_NFKD                         // true

We can say that if two sequences are canonically equivalent, they are compatible as well. But we can not always say that vice versa.

In some respects (such as sorting) both sequences should be treated as equivalent — and in some (such as visual appearance) they should not, so they are not canonically equivalent.

We can use normalize() using the “NFKD” or “NFKC” arguments to produce a form of the string that will be the same for all compatible strings.

Just like in this example, we see that “ﬀ” is not the same as two letters “f” combined. But if we wanted to sort this sequence, we would want to treat this symbol as two letters “f”, but we still want to keep their visual difference. Therefore these two sequences are compatible, but not canonically equivalent.

Last example will show the different results we get from applying each of the forms to the same string sequence.

// ORIGINAL SEQUENCE
// U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE
// U+0323: COMBINING DOT BELOW
let str = '\u1E9B\u0323';// NFC
// U+1E9B: LATIN SMALL LETTER LONG S WITH DOT ABOVE
// U+0323: COMBINING DOT BELOW
let nfc = str.normalize('NFC');
nfc + " (" + nfc.length + ")"                  // ẛ̣ (2)
[...nfc]                                       // ["ẛ", "̣"]// NFD
// U+017F: LATIN SMALL LETTER LONG S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
let nfd = str.normalize('NFD');
nfd + " (" + nfd.length + ")"                  // ẛ̣ (3)
[...nfd]                                       // ["ſ", "̣", "̇"]// NFKC
// U+1E69: LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
let nfkc = str.normalize('NFKC');
nfkc + " (" + nfkc.length + ")"                // ṩ (1)
[...nfkc]                                      // ["ṩ"]// NFKD
// U+0073: LATIN SMALL LETTER S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
let nfkd = str.normalize('NFKD');
nfkd + " (" + nfkd.length + ")"                // ṩ (3)
[...nfkd]                                      // ["s", "̣", "̇"]

Each of the forms creates a bit different result. First two forms abide by the rule of canonical equivalence, so the symbol keeps both its visual appearance and behavior. In the last two cases that guarantee compatibility, the visual changes, but both results are at least compatible among themselves, if not with the first two forms.

Okay, that was yet another nightmarish method covered. It wasn’t fun, but you will thank me later when you encounter it in the wilderness of the web.

As always — thank you so much for your time and see you soon with the next method.

Basics of Javascript · String · normalize() (method)

Written by Jakub Korch