Or “How I Learned Just Enough About Unicode Implementations To Solve a Bug”

Sam Havens
Jun 17 · 6 min read
Photo by Fausto García on Unsplash

Last week, I was investigating a bug, and in the process learned quite a bit about Unicode. After identifying the source of the bug, I found more instances of it in the wild. I’m writing this piece to pass on what I learned. The bug shows up when you:

  • Have a Python 3 backend…
  • Are processing strings which mix normal text and emoji…
  • Are Identifying spans of text based on string indexing (for example, using spaCy)…
  • And passing these indexes to a JavaScript front end

To make this tangible, here is an example of this bug I found in the wild, on the spaCy Rule-Based Matcher Demo. If you attempt to find the substring “firemen” in the larger string “👨🏻‍🚒 firemen drive firetrucks”, you see:

Uhm… 🤔

What is Unicode?

Unicode is [a] universal character set that defines the list of characters from [a] majority of writing systems and associates for every character [a] unique number (code point).

Unicode deals with characters as abstract terms. Every abstract character has an associated name, e.g. LATIN SMALL LETTER A. The rendered form (glyph) of this character is a.

A code point is a number assigned to a single character.

The code point is presented in the format U+<hex>, where U+ is a prefix that means Unicode and <hex> is a number in hexadecimal. For example U+0041 and U+2603 are code points.

Code points are numbers in a range from U+0000 to U+10FFFF.

- Dmitri Pavlutin, “What every JavaScript developer should know about Unicode

To make this tangible: “👨🏻‍🚒” is known as “Man Firefighter: Light Skin Tone” and is made up of the code points “U+1F468, U+1F3FB, U+200D, U+1F692”, which are in turn:

Once you know that

The Fitzpatrick skin typing test is a numerical classification schema for human skin color. — [Wikipedia]

Then this starts to make a lot of sense. In fact, at least in iTerm2, if you try typing:

s = “👨🏻‍🚒 firemen drive firetrucks”

you don’t see “Man Firefighter: Light Skin Tone”, but instead:

Close?

Unicode in Python

No matter what your terminal displays, if we process this string with spaCy, we see:

>>> doc = nlp(s)
>>> for token in doc:
... print(token.text, token.idx)
...
👨 0
🏻‍ 1
🚒 3
firemen 5
drive 13
firetrucks 19

We can verify this with string slicing:

>>> s[0]
'👨'
>>> s[1]
'🏻'
>>> s[2]
'\u200d'
>>> s[3]
'🚒'
>>> s[5:12]
'firemen'

Unicode in JavaScript

At this point I opened up the Chrome Developer Console to compare:

s = "👨🏻‍🚒 firemen drive firetrucks"
"👨🏻‍🚒 firemen drive firetrucks"
s[0]
"�"
s[1]
"�"
s[2]
"�"
s[3]
"�"

🤔 — that is… not the same. Let’s check something in Python:

>>> s
'👨🏻\u200d🚒 firemen drive firetrucks'
>>> len(s)
29

Let’s see if JavaScript agrees with this:

s
"👨🏻‍🚒 firemen drive firetrucks"
s.length
32

Oh no, the JavaScript string is longer by 3…


More Unicode: Encodings

So, either Python or JavaScript is just straight up wrong, or len(s) and s.length mean different things.

What the Python string len(s) means

I’m not sure about this, so maybe someone will correct me in the comments or on Twitter, but AFAICT, Python indexes refer to abstract Unicode code points. So, since most characters¹ are made up of one code point, most of the time s[i] will give you the letter at the ith position of a string s. However, Unicode allows for code points that modify other code points. We’ve seen two examples: U+1F468, “MAN” can be modified by U+1F3FB “EMOJI MODIFIER FITZPATRICK TYPE-1–2" (or any of the skin tone code points) to produce “LIGHT SKINNED MAN”; and a man or woman emoji (with optional skin tone modifier) can be prepended to U+200D U+1F692 “ ZERO WIDTH JOINER FIRE ENGINE” (�🚒) to make a firefighter.

¹ I think the most accurate term is “grapheme.” Read the blog post I link to at the end of this article for more details.

A non-emoji example of a modifying character is U+0301 “COMBINING ACUTE ACCENT”:

>>> s = "cafe" + "\u0301"
>>> s
'café'
>>> len(s)
5

A fun Python trick is that you can also refer to Unicode characters by their names, so this is also valid Python 3:

>>> s = "cafe" + "\N{COMBINING ACUTE ACCENT}"
>>> s
'café'
>>> len(s)
5

What JavaScript String.length means

JavaScript strings exist at a lower level of abstraction. Rather than referencing to an abstract code point, the index of a string in JavaScript refers to a number determined by the “encoding” of the string.

IMO, this is a failure of the language. Unicode is a leaky abstraction in JS, that is, you have to understand how the abstraction is implemented in order to use it properly. In JavaScript, strings are implemented using UTF-16 — an encoding which maps an abstract Unicode code point to one or two hexadecimal numbers.

Here are some characters that JavaScript characters that are represented with one hex character

'\u0041\u0042\u0043'
'ABC'

'I \u2661 JavaScript!'
'I ♡ JavaScript!'

However, “or two” is where our problems are coming from:

'\uD83D\uDCA9'
'💩' // U+1F4A9 PILE OF POO

Going back to our Light Skinned Firefighter Man: The abstract Unicode sequence representation is four Unicode code points:

U+1F468, U+1F3FB, U+200D, U+1F692

However, when encoded as UTF-16, these four characters are represented as the seven hexadecimal digits:

0xD83D 0xDC68 0xD83C 0xDFFB 0x200D 0xD83D 0xDE92

Hmmm… so it takes 3 more slots to represent this symbol in JavaScript than it does in Python…


Resolution

Now we see where the problem is coming from. The specific example in Python:

>>> s
'👨🏻\u200d🚒 firemen drive firetrucks'
>>> len(s)
29

Versus in JavaScript:

s
"👨🏻‍🚒 firemen drive firetrucks"
s.length
32

We have two options for fixing this:

  1. We can force Python to encode the string in UTF-16 before referencing string indexes
  2. Force JavaScript to work at a higher level of abstraction

Some aesthetic sense in me strongly favors the later, and it seems that the designers of ECMAScript agree. In newer versions of JavaScript, you can get it to handle Unicode like Python with the following trick:

s = "👨🏻‍🚒 firemen drive firetrucks"
"👨🏻‍🚒 firemen drive firetrucks"
s.length
32
[...s].length
29

That is, destructuring a string to an array does so in an encoding-independent way.

[...s]
(29) ["👨", "🏻", "‍", "🚒", " ", "f", "i", "r", "e", "m", "e", "n", " ", "d", "r", "i", "v", "e", " ", "f", "i", "r", "e", "t", "r", "u", "c", "k", "s"]

As a reminder, we’re trying to identify the indexes of the substring “firemen.” Python says:

>>> s[0]
'👨'
>>> s[1]
'🏻'
>>> s[2]
'\u200d'
>>> s[3]
'🚒'
>>> s[5:12]
'firemen'

But the naive JavaScript doesn’t work:

s.slice(5,12)
"🚒 fire"

However, we can use destructuring to clean this up:

[...s].slice(5,12).join('')
"firemen"

We can even encapsulate this in a function:

const unicodeSlice = (s, start, end) => [...s].slice(start, end).join('')

unicodeSlice("👨🏻‍🚒 firemen drive firetrucks", 5, 12)
"firemen"

Now our problem is solved! And as an added bonus, we’ve plugged the leak in JavaScript’s abstraction for how it handles Unicode.

Better Programming

Advice for programmers.

Sam Havens

Written by

I used to teach and study math and physics, now I do Natural Language Processing.

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade