Finally moving past “💩”.length === 2

Jackson Tenclay
4 min readJul 1, 2017

--

As a sad reminder: Javascript has a problem with Unicode. It was originally written to play well with Unicode’s Basic Multilingual Plane, which contains the letters and symbols for the majority of the world’s writing systems. The thing is, the world’s a big place, and it turns out we needed a little more. Anything above those 65,000 or so characters was relegated to one of the supplementary planes, which can house enough code points to deserve their nickname — the “astral planes.”

That’s all great and good, except that our emoji live up in the astral planes, too. Anything above code point U+FFFF is a mystery to Javascript, so it has to use a little workaround called surrogate pairs (see the original article for a detailed explanation about this whole mess). The short version is that to access astral characters like 💩, Javascript takes two dummy code points and processes them together to get an astral character. When the console says that "💩".length === 2, it’s not really reading "\u{1F4A9}".length, because that’s too high for Javascript to compute. It’s reading "\uD83D\uDCA9".length, which is its two-code-point surrogate pair. Life’s tough 😞.

The beginning of a workaround

Like Jonathan wrote in his article, there are ways around this. I’m not going to cover these points in depth because he’s a better storyteller than I am 😅. Here’s a summary:

  • Jonathan used a method called Array.from("string").length to convert the string to an Array (which keeps surrogate pairs together) and then take the length of that array:
    Array.from("💩").length === 1. One down!
  • A little code point called variation selector-16 makes sure that old dingbats that have since been updated get displayed in their more presentable emoji version. That way you can see a nice ❤️️ instead of the sad old ❤. Javascript doesn’t get that, though, so we have to make sure it ignores that code point when computing length.
  • Another code point called the zero width joiner chains emoji together to form a new one. For example, 👩‍🔬 (woman scientist) uses the ZWJ to combine 👩 (woman) and 🔬 (microscope). Nice, eh? Same treatment as before — we tell Javascript to ignore the ZWJ along with whatever emoji comes after it.

Unfortunately, that’s where the article ends — with a solid 🤷. As far as I knew, the matter was left unresolved. And I went merrily on my way until the day this exact problem happened to me as well.

Emoji social network???

I was working on a little project where the general premise was that people could send each other emoji — no more, no less (remember Yo?). All your received emoji would huddle around each other in a little circle (fun) so you wouldn’t have to see them in a list (not fun). But this all rode on the basis that each emoji message was a single character.

I wrote the logic to make sure that Emoji.length === 1…except that then everything broke.

I read through Jonathan’s article (again), tried to find my way through an ES6 example from 2013, a PHP solution, one in C#…but didn’t see any for Javascript. So I tried my best to make my own.

The solution

(Did you skip straight here? Shame on you.)

I ended up starting from scratch, taking in an emoji string and running a series of “replacement” operations on it to see if I could take out all the modifiers. Something like this:

var emojiLength = function(string) {
string = string.replace(/<something>/g, "<something>");
return string.length;
};

First I tried the original three steps.

  • Make the surrogate pairs behave as a single unit by replacing them with a dummy character:
    string = string.replace(/[\ud800-\udbff][\udc00-\udfff]/g, "_");
  • Take out variation selector-16:
    string = string.replace(/\ufe0f/g, "");
  • Remove the zero width joiner along with the character that immediately follows:
    string = string.replace(/\u200d./g, "");

But even though that handled a lot of emoji, it left me with a couple randoms that didn’t work.

emojiLength("🙎🏾") === 2    // whyyy

Turns I forgot about the skin tone modifiers. Also turns out I forgot characters like 3️⃣ 1️⃣ 2️⃣ which have their own random modifier. Also turns out that I forgot about every country flag. Also turns out…

Anyway

I ended up with a function that gives me a length of one for any emoji I was able to find on the internet, as well as some other flags that aren’t well-supported yet. Check it out:

If you have any trouble with it, or if (heaven forbid) it just doesn’t work for you, let me know. Emoji are too good to pass up 🌚

P.S. I’ll be looking for a web dev job in the next couple months. Shoot me a line if you think we might work out.

--

--