The Curious Case of String.slice/3

Dima Ermilov
3 min readSep 13, 2020

--

Elixir works wonderfully with strings. Some edge parts of it could be a little bit confusing though. I’d like to walk through a few of them.

Say, we have some text data like a company name, some web page to show it, and some database to store it. Say, we’re not the one who designed both the web page and the DB, so they both have 80 characters limit, and there’s nothing we can do about it.

To keep things smooth, we would like to add some trimming to our changeset.

Looks nice and clean and works fine until it doesn’t:

Wait, what? Oh yeah. Elixir is very smart and Unicode-aware. So it knows about graphemes, it counts those as a single character just like a native reader would. Isn’t it great? It lets us show complex foreign texts without breaking them literally in the middle of a letter. The only downside of this approach is that databases are usually not that smart. They count characters by codepoints, not by graphemes.

iex> name |> String.codepoints |> Enum.count
111

So some of these very beautiful letters are two codepoints, not one. OK, we can deal with it.

The downside is we’re keeping graphemes intact only if we’re lucky. So probably we should count by codepoints but cut by graphemes. Enum.reduce_while/3 is at your service.

This will keep adding graphemes to the acc while codepoints count of the resulting string is less or equal 80. The only downside is that we have O(n²) here, eww. Let’s fix it:

Are we done here? Well, kind of yes. And kind of no.

Level Up: HTML maxlength and Other Animals

So far so good. We may check out Ruby or Python, they will count it the same way. Our new fancy trimmer works perfectly with any regular database, even with MySQL (if we have utf8mb4 charset for the table).

But sometimes, on very rare occasions, it fails spectacularly.

Those occasions are:

  • Javascript string length checkers in general and Node.js backends in particular,
  • HTML input areas with maxlength attribute, and last but not least
  • Salesforce API text fields. Same thing, only nastier, will happen if we try to insert this into the Salesforce Name field: STRING_TOO_LONG:Name: data value too large.

How in the world is it possible? A bit of a googling gives us the answer:

There’s no UTF-8 when you consider JavaScript internal string representation, there’s UTF-16. JavaScript String length property returns the number of UTF-16 code units in the string’s internal representation. In most cases, there’s one UTF-16 code unit per one UTF-8 code point. But for some symbols, there’s two. They are called “surrogate pairs”. For example, every single-codepoint emoji is two UTF-16 code units. This family emoji 👨‍👩‍👧‍👧 is one grapheme, 7 UTF-8 code points, and 11 UTF-16 code units.

So what shall we do? Luckily, Elixir has access to Erlang’s :unicode module, which has a character to binary conversion. The only thing left is to count the number of 2-byte sequences, so the counter would look like this:

Let’s make our trimmer a bit more universal by passing both the counter function and the trim length to it.

Et voilà. Now we can trim like Postgres and trim like SalesForce while having our graphemes intact.

P.S. Like many of us, I never had a chance to get a proper CS education, and I’m new to Elixir too, so please feel free to add your thought or to correct my mistakes.

Cheers,
Dima

--

--

Dima Ermilov

Side notes on a journey of a (more or less) self-taught developer. Ruby (https://rubygems.org/gems/iguvium/), Elixir.