Unicode text length - Python vs Elixir
Summary: Some characters (as perceived by users) are represented by multiple Unicode characters. Programmers should be aware of this when measuring text length or splitting text. This article describes such cases, and introduces a Python library for handling such character clusters.
Learning Elixir, or what is a grapheme?
I come from a Python and JS background. As most programmers, I write code that handles text every once in a while. I’m used to writing
string.length to get the number of characters in a string, and have been doing so without much thought.
When I started working with Elixir, I read about the standard String module. It’s documentation introduced the concept of graphemes, which apparently meant that some of my understanding of Unicode text was wrong.
A grapheme is, according to the Unicode glossary:
(1) A minimally distinctive unit of writing in the context of a particular writing system. For example, ‹b› and ‹d› are distinct graphemes in English writing systems because there exist distinct words like big and dig. Conversely, a lowercase italiform letter a and a lowercase Roman letter a are not distinct graphemes because no word is distinguished on the basis of these two different forms. (2) What a user thinks of as a character.
For traditional ASCII text (without CRLF-line endings), all graphemes are represented by a single Unicode code point, but that is not always the case. Some graphemes are represented by a sequence of two or more Unicode characters (code points).
The default rules for graphemes are defined in Unicode Standard Annex #29.
These are some of the types of multi code point graphemes that may exist. Note that all Python code examples are in Python 3, so strings are unicode strings by default (byte strings is a whole other beast).
Some characters have diacritics added to them, such as å, ä and ö (used in my native language, Swedish). Some common characters with diacritic marks have a single Unicode character (including å, ä and ö), but they may also be represented by the base character followed by a the Unicode character for the diacritic: `
"a\u0308" represent the same character/grapheme, they are not equal in Python or most other languages. Nor do they have the same length. This can be handled by using unicodedata.normalize, but that is just usable for the few diacritic combinations that have a designated Unicode character.
Some characters with diacritic marks does not have a designated Unicode character. One example is the special characters used in Esperanto: ĉ, ĝ, ĥ, ĵ, ŝ, and ŭ. ĉ is represented as U+0065 (Latin Small Letter e) + U+005E (Circumflex Accent).
The Korean alphabet, Hangul, has 40 letters. They are, however, not written in sequence as in latin script, but grouped into blocks which form a single syllable. I don’t know Korean, but will include an example from the wikipedia article:
That is, although the syllable 한 han may look like a single character, it is actually composed of three letters: ㅎ h, ㅏ a, and ㄴ n.
Some emojis support modifiers, which alter their appearance. The most common is probably the skin tone modifiers. For example U+270A (Raised Fist) is rendered as ✊ . But if the following character is 🏾 (U+1F3FE - Emoji Modifier Fitzpatrick Type-5), it’s rendered as ✊🏾.
This is probably the most common issue that modern web developers might encounter if not accounting for graphemes. How would you get the first character in the string ”✊🏾 some text” in Python?
>>> "✊🏾 some text"
>>> "✊🏾 some text"[1:]
'🏾 some text'
National emoji flags are represented as a sequence of code points matching the country’s two letter country code. So 🇸🇪 (Sweden/SE) is represented as two Unicode characters: U+1F1F8 (Regional Indicator Symbol Letter S) + U+1F1EA (Regional Indicator Symbol Letter E):
>>> "Welcome to Sweden: 🇸🇪"[-1]
Other wierd shit
Today is pride day in Stockholm, so there is no better day to examine the text representation of 🏳️🌈. That simple and beautiful flag is actually represented by four Unicode characters:
- 🏳 U️+1F3F3 (White Flag). A white flag, shown waving on a post. Traditionally used as a sign of surrender, the meaning of this flag as an emoji is less defined.
- ️ U+FE0F (Variation Selector-16). An invisible codepoint which specifies that the preceding character should be displayed with emoji presentation. Only required if the preceding character defaults to text presentation.
- U+200D (Zero Width Joiner). Zero Width Joiner (ZWJ) is a Unicode character that joins two or more other characters together in sequence to create a new emoji.
- 🌈 U+1F308 (Rainbow)
Python vs Elixir — what is a character?
(Modern) Python represents strings as a sequence of Unicode code points. All string operations related to character count is based on the code points without any consideration to graphemes. So
len(string) will return the number of Unicode code points in a string, and slicing and indexing will use the code point sequence and not the sequence of graphemes.
Elixir’s String module defaults to considering graphemes.
String.length(string) will count the number of graphemes and
String.slice(string, 0, 10) will return the first 10 graphemes (even if more than 10 Unicode code points is used to represent those graphemes).
For most use cases, the Elixir way of using graphemes for character boundaries is probably more correct. It does, however, come at a performance cost. Counting graphemes require traversing the string, checking the type of character against the previous state and the boundary rules. As such, it will require linear time in relation to the string length. Getting the number of code points in a Python string can be done in constant time.
Introducing the Python library grapheme
When reading up on this, I couldn’t find a simple to use Python library for grapheme aware string operations. So I wrote grapheme, which contains a number of functions for counting and slicing based on grapheme character boundaries.
>>> import grapheme
>>> grapheme.slice("✊🏾 some text", 0, 1)
>>> grapheme.slice("✊🏾 some text", 1)
' some text'
>>> list(grapheme.graphemes("Welcome to Sweden: 🇸🇪"))[-1]
There is also uniseg, with a very similar purpose. It is, however, still at Unicode version 6.2.0, so it lacks many recent rule changes and newly added characters. The current Unicode version is 10.0.0.
One should also consider PyICU, which is a Python wrapper for the ICU C library.