An Implementation of Unicode Normalization

Streaming NFC, NFD, NFKC & NFKD, normalization QC and normalization preserving concatenation.

Sven Van Caekenberghe, Henrik Sperre Johansen

Unicode is an industry-wide international standard for the consistent encoding, representation, and handling of text expressed in almost all of the world’s writing systems. Unicode’s unifying character set contains tens of thousands of entries covering 129 modern and historic scripts, as well as multiple symbol sets.

The standard also includes related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order. Taken together, this covers a complex set of functionality for internationalization and localization of computer software.

This article describes one implementation of normalization in a high level language. The sections Decomposition and Composition are very technical and can be skipped.

The appendix describes how to get and install the actual software.

What is normalization and why do we need it ?

Unicode, being a unifying character set, contains characters that allow similar results to be expressed in different ways.

A simple example in common languages (the Latin letter é, e-acute) is

LATIN SMALL LETTER E WITH ACUTE [U+00E9]

which can equivalently be written as

LATIN SMALL LETTER E [U+0065] + COMBINING ACUTE ACCENT [U+0301]

The former being a composed normal form, the latter a decomposed normal form. In visible glyphs, we can write this relationship as

é = e + ´

Not only characters with diacritical marks have multiple representations. For instance, the ellipsis character of 3 dots, is equivalent to 3 separate dots in a row, if we are interested in semantic meaning.

HORIZONTAL ELLIPSIS [U+2026]
FULL STOP [U+002E] + FULL STOP [U+002E] + FULL STOP [U+002E]
… = . + . + .

Some characters have more than 2 representations. The Ångström symbol Å can be written in three ways.

ANGSTROM SIGN [U+212B]
LATIN CAPITAL LETTER A WITH RING ABOVE [U+00C5]
LATIN CAPITAL LETTER A [U+0041] + COMBINING RING ABOVE [U+030A]

It is possible to give many more examples in almost any language, most of which we would not understand. It is for example possible and quite common to have multiple combining marks on a single base letter.

Given that similar text can be written in different ways, we have a problem. How can we determine if two strings are equal ? How can we find a substring in a string ?

Photo by Romain Vignes

The answer is to convert the string to a well-known form. This is normalization. Unicode normalization is a set of rules based on tables and algorithms. It defines two kinds of normalization equivalence: canonical and compatible.

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. Our first and third examples above (é and Å) are examples of canonical mappings.

Code point sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Our second example above (…) is an example of a compatible mapping. Other examples are different representation of the decimal digit 1, like the Roman numeral I or a 1 in a circle. In one sense they are the same, in another sense they are quite different.

Compatible equivalence is a superset of canonical equivalence. In other words each canonical mapping is also a compatible one, but not the other way around.

Composition is the process of combining marks with base letters (multiple code points are replaced by single points whenever possible). Decomposition is the process of taking already composed characters apart (single code points are split into multiple ones). Both processes are recursive.

An additional difficulty is that the normalized ordering of multiple consecutive combining marks must be defined. This is done using a concept called the Canonical Combining Class or CCC, a Unicode character property.

All combined, four normalization forms are defined:

  1. NFD — canonical decomposition and ordering
  2. NFC — composition after canonical decomposition and ordering
  3. NFKD — compatible decomposition and ordering
  4. NFKC — composition after compatible decomposition and ordering
     Decomposition Composition
NFD canonical no
NFKD compatible no
NFC canonical yes
NFKC compatible yes

Unicode Character Data

The Unicode standard consists of a large number of documents as well as a set of tables. The Unicode Character Database is the most important one. The first step was to make this data, all 29,189 entries as of version 8 of the Unicode standard, available for usage and exploration.

You can send the message #unicodeCharacterData to Character or Integer objects. This will return an instance of UnicodeCharacterData when found. You can then query this object for more information.

65 unicodeCharacterData
$A unicodeCharacterData

An inspector extension helps to interpret and understand the different fields.

The screenshot above shows, in the left pane, what inspecting the Unicode character data of our example letter é looks like. The right pane shows a large glyph view of the character involved.

Most standard fonts only support a limited subset of Unicode. Many code points will be displayed using the “missing” glyph (a vertical rectangle with a diagonal cross, a bit like ⊠). For better coverage, try switching to a font with wide Unicode support, such as Arial Unicode MS.

Inspecting the decomposition of a character is as simple as clicking on that field on the left, with the decomposed elements showing on the right.

There is only a single inspector extension, yet it is quite useful in many situations. The previous screenshot shows the elements (i.e. characters) of the string Ångström being inspected on the left as Unicode character data details on the right.

We also added an extension to our environment’s search system to allow you to look up characters by name:

The user entered horizontal el in the search field and got 5 entries back, the first one of which is previewed.

Implementation Choices

After building some prototypes and trying a couple of approaches, we arrived at our current implementation choices. We will implement all fundamental algorithms to work on code points (Unicode scalar values). As much as possible, we will implement them as operations on streams.

The reason is that in practice this feel right while offering maximal flexibility towards the future (other string or character representations). With a few conversion streams, and some convenience methods to help create these (#unicodeCodePoints and #unicodeCharacters), it becomes trivial to switch representations while preserving stream semantics.

'abc' readStream unicodeCodePoints upToEnd
=> #(97 98 99)
#(97 98 99) readStream unicodeCharacters upToEnd
=> 'abc'

Decomposition

The decomposition procedure is basically to look up each code point in the Unicode Character Database, and replace it with its decomposed sequence, if any. This must be done recursively.

The next steps depend on a character code point property called the Canonical Combining Class (#canonicalCombiningClass or #ccc). When the CCC is zero, the character code point is called a starter (#isStarter), else it is a non-starter (#isStarter not).

After the first decomposition phase, all sequences of non-starters in the decomposition, including non-starters up to the next starter in the input, must be sorted according to their Canonical combining class.

We define a new class, UnicodeDecomposingReadStream, that wraps a code point read stream named input. The idea is that each time you read from it, it will decompose what it reads from its input. Since decomposition expands, we need a small internal FIFO buffer.

Object subclass: #UnicodeDecomposingReadStream
instanceVariableNames: ‘input buffer’
classVariableNames: ‘’
category: ‘Unicode-Normalization’

The basic read stream operation is next, which is equivalent in this case to returning the next code point after decomposition. We return and remove the first element from the internal buffer, filling it if necessary. EOF is implemented by returning nil.

next
buffer ifEmpty: [ self nextChunk ].
buffer ifEmpty: [ ^ nil ].
^ buffer removeFirst

Internally, the core operation is to get and fully decompose the next chunk of input, which we do into a buffer.

nextChunk
input atEnd
ifFalse: [
self addMappingFor: input next.
self addNonStarters ]

Since we’ll be doing the additions to the buffer respecting the CCC order, we’re done. The first sub operation is abstract and has to be implemented by a concrete subclass.

addMappingFor: codePoint
self subclassResponsibility

The first concrete subclass is UnicodeNFDReadStream which implements this operation only when there is a canonical decomposition. When there is no applicable decomposition, the code point is added as is.

addMappingFor: codePoint
| ucd |
ucd := [ codePoint unicodeCharacterData ]
on: NotFound do: [ ].
(ucd notNil and: [ ucd hasCanonicalDecomposition ])
ifTrue: [
ucd decompositionMappingDo: [ :each |
self addMappingFor: each ] ]
ifFalse: [
self addToBufferCccOrdered: codePoint ]

The second concrete subclass is UnicodeNFKDReadStream which uses a slight variation. Here, both canonical and compatible mappings are applied.

addMappingFor: codePoint
| ucd |
ucd := [ codePoint unicodeCharacterData ]
on: NotFound do: [ ].
(ucd notNil and: [ ucd hasDecomposition ])
ifTrue: [
ucd decompositionMappingDo: [ :each |
self addMappingFor: each ] ]
ifFalse: [
self addToBufferCccOrdered: codePoint ]

Note the recursive invocation of #addMappingFor: in both cases.

Additions to the buffer are done by appending code points while maintaining CCC ordering. This is done by scanning from the end backwards until the correct insertion point is found.

addToBufferCccOrdered: codePoint
| ccc index stop otherCCC |
ccc := [ codePoint unicodeCharacterData ccc ]
on: NotFound do: [ 0 ].
index := buffer size.
ccc = 0
ifFalse: [
stop := false.
[ index > 0 & stop not ] whileTrue: [
otherCCC := [ (buffer at: index)
unicodeCharacterData ccc ]
on: NotFound do: [ 0 ].
ccc < otherCCC
ifTrue: [ index := index — 1 ]
ifFalse: [ stop := true ] ] ].
buffer add: codePoint afterIndex: index

The second sub operation in getting the next chunk is to add any additional non-starters from the input, because it is possible that they have to be CCC reordered with what we already have in the buffer.

addNonStarters
| stop |
stop := false.
[ input atEnd | stop ] whileFalse: [
([ input peek unicodeCharacterData isStarter not ]
on: NotFound do: [ false ])
ifTrue: [ self addMappingFor: input next ]
ifFalse: [ stop := true ] ]

These couple of methods fully implement the core of decomposition. Using it in its raw form consists of chaining a couple of read streams.

'élève' readStream unicodeCodePoints 
unicodeNFD
unicodeCharacters upToEnd
=> 'e´le`ve'

The above result shows the combining marks composed with a space, else they would not show clearly.

Photo by Marcus dePaula

Composition

The elementary composition procedure operates between two starters, combining pairs, unless blocked. Combination is blocked when a code point with an out of order Canonical Combining Class interferes.

Because UnicodeCharacterData stores a mapping from a precomposed character code point to its constituents only, it is more efficient to precompute the reverse mapping. We create UnicodeComposer as a helper class.

Object subclass: #UnicodeComposer
instanceVariableNames: ‘combinations combiningCharacters’
classVariableNames: ‘Default’
category: ‘Unicode-Normalization’

When necessary, the following initialization is invoked. A default instance of this class is cached in the Default class variable to avoid doing this process more than once.

initializeForComposition
combinations := IdentityDictionary new.
combiningCharacters := IdentitySet new.
UnicodeCharacterData database valuesDo: [ :each |
each isPrimaryComposite ifTrue: [ | combination |
combination := each decompositionMapping.
“combination first + combination second
= each codePoint”

combinations
at: combination first
ifPresent: [ :value |
value at: combination second put: each codePoint ]
ifAbsent: [ | value |
value := IdentityDictionary new
at: combination second
put: each codePoint;
yourself.
combinations at: combination first put: value ].
combiningCharacters add: combination second ] ]

Only primary composites are considered, these always map into two elements. The set of combining character will allow us to quickly determine whether a character can combine with another one. The combinations dictionary maps base characters with another dictionary, a map from combining character to the result.

Given the input e + ´ we find that ´ is indeed a combining character. Then we find that e can be a base for composition. This then gives us a map that includes ´ to é. Any of these conditions could be false and thus not lead to composition.

primaryCombinationOf: first and: second
^ (self isCombiningCharacter: second)
ifTrue: [
self combinations
at: first
ifPresent: [ :compositions |
compositions at: second ifAbsent: [ ] ]
ifAbsent: [ ] ]

We won’t go into the details here, but there is a Unicode code block of 11,184 code points that is algorithmically defined instead of being table driven: the Hangul Syllables. Full test suite conformance can only be reached by fully implementing this part of the specification, which we did.

combinationOf: first and: second
second ifNil: [ ^ nil ].
^ (self hangulCombinationOf: first and: second)
ifNil: [
self primaryCombinationOf: first and: second ]

The elementary public operation is to compose a buffer, consisting of an initial starter followed by CCC ordered non-starters, modifying it in place.

composeBuffer: buffer 
| lastCCC index ccc combination |
lastCCC := 0.
index := 2.
[ index <= buffer size ] whileTrue: [
ccc := [ (buffer at: index) unicodeCharacterData ccc ]
on: NotFound do: [ 0 ].
combination := self combinationOf: buffer first
and: (buffer at: index).
(combination notNil
and: [ (lastCCC < ccc) | (lastCCC = 0) ])
ifTrue: [
buffer at: 1 put: combination.
buffer removeAt: index ]
ifFalse: [
index := index + 1.
lastCCC := ccc ] ]

We can now implement UnicodeComposingReadStream to process input chunk by chunk while composing each as a buffer. Each chunk will consist of an initial starter followed by CCC ordered non-starters and end with another starter. The last starter will become the initial starter for the next chunk.

Object subclass: #UnicodeComposingReadStream
instanceVariableNames: ‘input buffer first composer’
classVariableNames: ‘’
category: ‘Unicode-Normalization’

Again, we wrap an input code point read stream and use a FIFO buffer. This time, start up is a bit more complicated and less elegant as the initial series of non-starters must be skipped — this is the role of first: to hold one of these non-starters until we are beyond that phase and have seen a starter, at which point it holds a special value. We hold the default instance of UnicodeComposer in composer.

next
self isFirstEmpty
ifFalse: [ ^ self consumeFirst ].
self shouldGetNextChunk
ifTrue: [ self nextChunk ].
self isFirstEmpty
ifFalse: [ ^ self consumeFirst ].
buffer ifEmpty: [ ^ nil ].
^ buffer removeFirst

The basic processing step is to get a next chunk and compose it.

nextChunk
“Initialize when needed,
try putting first starter in buffer”

first = #initialized
ifFalse: [
self scanFirstStarter
ifFalse: [ “Non-starter is in first” ^ self ] ].
“buffer = <starter1>”
[
self scanUntilStarter.
“buffer = <starter1> … <starter2>”
composer composeBuffer: buffer ]
doWhileFalse: [ buffer size > 1 or: [ input atEnd ] ]
“There has to be more than one element in the buffer
unless we’re eof. Composition shrinks the buffer,
sometimes recursively, but can need additional starters”

The first phase is scanning for the first starter, immediately ansering non-starters.

scanFirstStarter
| current |
“Find the first starter in input to use,
put it in buffer and return true.
Else put the non-starter in first and return false.
Switch to #initialized once we’ve seen
the first starter or when empty.”

input atEnd ifFalse: [
current := input next.
([ current unicodeCharacterData isStarter ]
on: NotFound do: [ false ])
ifTrue: [
buffer addLast: current.
first := #initialized.
^ true ]
ifFalse: [
first := current.
^ false ] ].
first := #inialized.
^ false

Once in the second phase, we scan until a starter.

scanUntilStarter
| current |
“Put non-starters and the next starter from input
in the buffer, if any”

[ input atEnd ] whileFalse: [
current := input next.
buffer addLast: current.
([ current unicodeCharacterData isStarter ]
on: NotFound do: [ false ])
ifTrue: [ ^ self ] ]

Some smaller helper methods round out our implementation.

isFirstEmpty
^ first isNil or: [ first = #initialized ]
consumeFirst
| current |
current := first.
first := nil.
^ current
shouldGetNextChunk
“One element should remain in the buffer
for the next iteration unless we’re eof”


^ buffer isEmpty
or: [ buffer size = 1 and: [ input atEnd not] ]

Again, using composition in its raw form consists of chaining a couple of read streams.

'e´le`ve' readStream unicodeCodePoints 
unicodeCompose
unicodeCharacters upToEnd
=> 'élève'

The above input shows the combining marks composed with a space, else they would not render clearly. To generate the correct input string you can evaluate the following expression.

#(101 769 108 101 768 118 101) readStream 
unicodeCharacters
upToEnd
Photo by Amador Loureiro

Normal Forms

The decomposed normal forms, NFD and NDKD are reached by decomposition alone. The example given at the end of the decomposition chapter is enough.

The composed normal forms, NFC and NFKC are the result of composition after decomposition. We can elegantly express this by chaining.

unicodeNFC
“Return a UnicodeComposingReadStream over the receiver
that streams over Integer code points composing them
after decomposing them canonically”


^ self unicodeNFD unicodeCompose
unicodeNFKC
“Return a UnicodeComposingReadStream over the receiver
that streams over Integer code points composing them
after decomposing them compatibly”


^ self unicodeNFKD unicodeCompose

Normalization Quick Check

The code described previously is relatively complex and slow. By first checking if normalization is really needed, things can be sped up quite a bit.

The Normalization Quick Check procedure consists of a simple algorithm that consults a number of tables. For each normalization form, character code points are classified in three values: yes, it can occur as is in this normalization form, no, it cannot ever, or maybe, we don’t know.

normalizationQuickCheck: property forCodePoint: codePoint
“Return #Y (yes), #N (no) or #M (maybe) for property,
#NFC_QC, #NFD_QC, #NFKC_QC or #NFKD_QC”


^ (self normalizationQuickCheck at: property)
at: codePoint ifAbsent: [ #Y ]
normalizationQuickCheck: property forCodePointStream: stream
| result lastCCC codePoint ccc check |
result := #Y.
lastCCC := 0.
[ stream atEnd ] whileFalse: [
codePoint := stream next.
ccc := [ codePoint unicodeCharacterData ccc ]
on: NotFound do: [ 0 ].
(lastCCC > ccc and: [ ccc ~= 0 ]) ifTrue: [ ^ #N ].
check := self normalizationQuickCheck: property
forCodePoint: codePoint.
check = #N ifTrue: [ ^ #N ].
check = #M ifTrue: [ result := #M ].
lastCCC := ccc ].
^ result

Normalization High Level Interface

To operate on Strings, a small tool, UnicodeNormalizer, offers a high level interface. This tool makes use of the Quick Check and some other properties to optimize the operations a bit, preventing unnecessary work. Next is the code for the canonical variants, the compatible variants are similar.

toNFC: string
“Return the NFC of string,
the canonical composition normal form”


(self isAlreadyNFC: string) ifTrue: [ ^ string ].
^ string readStream unicodeCodePoints
unicodeNFC unicodeCharacters upToEnd
isAlreadyNFC: string
“Latin1 strings are always in NFC”
^ string isByteString or: [
(UnicodeCharacterData normalizationQuickCheck: #NFC_QC
forString: string) = #Y ]
toNFD: string
“Return the NFD of string,
the canonical decomposition normal form”


(self isAlreadyNFD: string) ifTrue: [ ^ string ].
^ string readStream unicodeCodePoints
unicodeNFD unicodeCharacters upToEnd
isAlreadyNFD: string
“ASCII strings are always in NFD”
^ (string isByteString and: [ string isAsciiString ])
or: [ (UnicodeCharacterData
normalizationQuickCheck: #NFD_QC
forString: string) = #Y ]

Normalization Preserving Concatenation

Extracting sub strings out of a normalized string maintains normalization. The reverse, string concatenation, does not. Characters at the join point might interact and compose. Consider the following:

éle + `ve = élève

The last e of the first string and the first `of the second string combine under NF(K)C. Note also the two strings of size 3 combine to a string of size 5.

The algorithm to efficiently concatenate while maintaining a specified normalization form is implemented in UnicodeConcatenator. The idea is to search backward from the end of the first string and forward from the beginning of the second string until we find so called stable characters which cannot interact. Then the middle section is normalized. Care is taken to minimize copying and allocation.

concatenateString: first with: second
| lastStable firstStable middle result |
first ifEmpty: [ ^ second ].
second ifEmpty: [ ^ first ].
lastStable := first findLast:
self isStableCharacterBlock.
lastStable = 0 ifTrue: [ lastStable := 1 ].
firstStable := second findFirst:
self isStableCodePointBlock.
firstStable = 0 ifTrue: [ firstStable := second size ].
(lastStable = first size and: [ firstStable = 1 ])
ifTrue: [ ^ first , second ].
middle := (first copyFrom: lastStable to: first size)
, (second copyFrom: 1 to: firstStable).
middle := self normalizeString: middle.
result := first class new: (lastStable — 1 +
middle size +
second size — firstStable).
result
replaceFrom: 1
to: lastStable — 1
with: first
startingAt: 1;
replaceFrom: lastStable
to: lastStable + middle size — 1
with: middle
startingAt: 1;
replaceFrom: lastStable + middle size
to: result size
with: second
startingAt: firstStable + 1.
^ result
isStableCharacterBlock
| quickCheckProperty |
quickCheckProperty := self quickCheckProperty.
^ [ :each |
([ each unicodeCharacterData isStarter ]
on: NotFound do: [ true ]) and: [
(UnicodeCharacterData
normalizationQuickCheck: quickCheckProperty
forCodePoint: each codePoint) = #Y ] ]

The tool has to be initialized to a specific normal form before use.

UnicodeConcatenator forNFC 
concatenateString: 'abce'
with: '´def'
=> 'abcédef'

Conformance Tests

There is a test file in the standard specification for testing conformance to normalization. No less than 18,593 individual input sequences and their resulting 4 normal forms are part of this test suite.

The Pharo Unicode project passes all these tests 100% !

Usage Guidelines

In practice, you will chose one of the 4 normal forms for all of your strings. When text enters your application, you will have to make sure it is converted to your chosen normalization form.

Studies seem to indicate that more than 99% of all text on the internet is in NFC, making it the most popular normal form.

Should you use the safe form of concatenation ? To be absolutely safe, yes, but in practice probably not all the time.

References

Appendix: The Pharo Unicode project

Pharo, a unique immersive live development environment has pretty good support for Unicode. You can read more about the current situation in the first part of the Character Encoding and Resource Meta Description chapter of the Enterprise Pharo book.

The goal of the Pharo Unicode project is to gradually improve and expand Unicode support in Pharo. Much work remains to be done and contributions are more than welcome.

The code discussed in this article is available from SmalltalkHub in a project called Unicode. It was written for Pharo 4 and 5.

The easiest way to load the code is by using the Configuration Browser or Pharo Project Catalog. Search for Unicode and click Install stable version. You can load the code manually using its Metacello configuration.

Gofer it
smalltalkhubUser: 'Pharo' project: 'Unicode';
configuration;
loadStable.

Database files will be loaded automatically, on-demand, either from a URL over the internet or from a local file cache (extract UCD.zip next to your image). Processed datasets are then kept in your Pharo image.

Alternatively, you can download a prebuilt image containing everything as the latest successful build artifact from the Pharo Contribution CI job called Unicode.

Show your support

Clapping shows how much you appreciated Sven VC’s story.