Dart string manipulation done right 👉

Published in

Dart

6 min readJun 30, 2020

Like many other programming languages designed before emojis started to dominate our daily communications and the rise of multilingual support in commercial apps, Dart represents a string as a sequence of UTF-16 code units. The encoding worked fine in most cases, until increased internationalization and the introduction of emojis that go with any language made the encoding’s inherent problems become everybody’s problems.

Consider this example:

The image shows the string “Hello” with a handwaving emoji at the end and it’s UTF-16 code units. The emoji takes two units.

In the string “Hello👋”, each user-perceivable character is mapped to a single code unit except the waving hand emoji 👋. An immediate consequence of this mapping is confusion over the length of this string. Will the output of the following line of code be 6 or 7?

print('Hello👋'.length);

To the user, there are clearly 6 characters in this string unless you get philosophical. But the Dart String API will tell you that the length is 7, or to be precise, 7 UTF-16 code units. This difference has all kinds of ramifications, because so many text manipulation tasks involve using character indexes with the String API. For example, "Hello👋"[5] won’t return the 👋 emoji. Instead it will return a malformed character representing the first code unit of the emoji.

The good news is that Dart has a new package called characters that operates on user-perceivable characters instead of UTF-16 code units. However, you, as a Dart programmer, need to know when to use the characters package. Our research indicates that even experienced Dart programmers can easily miss such problems when reading text manipulation code. In this article, I go over some common scenarios where you need to pay extra attention and consider using the characters package instead of Dart String.

Scenarios to watch out for

In this section, I’ll go over a few common text manipulation scenarios, explain why using Dart’s String API could cause problems in these scenarios, and show how to use the characters package for more reliable results. The use cases below generally assume that we’re dealing with strings entered by human users, which could include emojis or characters in a language the app developer doesn’t expect.

Scenario 1: Counting characters in a string

Suppose you’re writing a function that checks if the text entered by the user has exceeded a specific number of characters. The function returns a positive number of remaining characters if the limit hasn’t been reached, or a negative number of extra characters if the limit has been exceeded.

This is pretty straightforward to do using the String API:

However, the following test reveals the problem with this code:

Here are the testing results:

Expected: <47>
  Actual: <46>

We can rewrite this function using the characters package, which provides a convenient extension method on String, to produce the correct number of characters as follows:

Scenario 2: Extracting a substring

In this scenario, we want to implement a function that deletes the last character from a string and returns the result as a new string. Let’s assume this string comes from user input.

This function is easy to implement using the substring method on String as follows:

However, a good emoji test can quickly break the code:

Here are the testing results:

Expected: ‘Hi ’
  Actual: ‘Hi 🇩???’
    Which: is different. Both strings start the same, but the actual value also has the following trailing characters: 🇩???

The characters package can handle this case with ease, as it provides high-level methods such as skipLast(int count). We can rewrite this snippet into the following code:

Scenario 3: Splitting a string on an emoji

In the third scenario, we want to split a string on a given emoji. Here is a function doing that using the split method on String:

Would it work? It probably will work just fine 99% of the time, but the test below illustrates an example where the above code produces rather surprising results.

Here are the testing results:

Expected: ['abc👨‍👩‍👧‍👦', 'abc', 'abc', 'abc']
  Actual: ['abc👨‍👩‍','‍👦', 'abc', 'abc', 'abc']
    Which: was 'abc👨‍👩‍' instead of 'abc👨‍👩‍👧‍👦' at location [0]

So, why did 👨‍👩‍👧‍👦 become two emojis 👨‍👩 when the string was split? It’s because 👨‍👩‍👧‍👦 is actually made of four different emojis: 👨👩👧👦. When the string was split on 👧, “abc👨‍👩‍👧‍👦” got separated into two parts: “abc👨‍👩” and “‍👦”.

You can avoid this issue by using the split method on the Characters class, as the following code shows:

Scenario 4: Accessing a specific character by its index

In text manipulation, it’s common to access a specific character by its index (i.e., position) in the string. For example, the snippet shows a function that returns initials from the first name and the last name entered by the user in two separate text fields:

But as we have demonstrated in the beginning of the article, using the index in a UTF-16-based string could be risky. Let’s verify the correctness of the above code with the test case below:

Here are the test results:

Expected: ‘ÉB’
  Actual: ‘EB’
    Which: is different.

Why did the test fail? It’s because the letter “É” could be a combination of “E” and the accent mark. You can use the characters package to easily avoid this problem:

Exercise: Omitting text overflow

Now, here’s a challenge for you. In this scenario, the app needs to display a list of messages, one per line. You’re asked to review code that implements a function that displays text overflow as an ellipsis when the message’s length exceeds the given character limit.

Can you come up with a test to reveal a potential issue with this code snippet? How would you rewrite it using the characters package? The answer is at the end of this article.

Mitigations and possible long-term solution

It’s unreasonable to expect Dart users to stay on high alert for the kinds of pitfalls described above. For example, in an experiment we conducted, 53.7% of Dart users were unable to detect the problem illustrated in the first scenario (counting characters), even though they received two pages of information about the characters package and the problem the package was designed to address just a few minutes before. Therefore, we are taking a two-staged approach to helping developers choose the most appropriate API for their text manipulation needs.

In the short term, we are introducing a set of mitigations in the Flutter framework and the Dart analyzer to make the characters package easier to discover and invoke in Dart UI programming. This involves a few steps:

Use the characters package in the internal implementation of the TextField widget. See this PR and this design doc for more details.
Expose the API of the characters package through the Flutter framework. Once this is done, Flutter users will have a higher chance of discovering the API through the extension method String.characters, which will show up when doing an autocomplete on String. The status of this work is tracked in this issue: https://github.com/flutter/flutter/issues/55593.
Update the Flutter framework’s API documentation and sample code to suggest using the Characters class when applicable, such as in the callback for TextField.onChanged. This work is tracked in https://github.com/flutter/flutter/issues/55598 with relevant details in this doc.
Have the Dart analyzer suggest converting a String object to a Characters object when autocompleting a callback template for handling user-entered text. For example, the IDE could fill out everything in the snippet below after the user autocompletes on onChanged. This work is tracked in https://github.com/dart-lang/sdk/issues/41677.

Those mitigations can help, but they are limited to string manipulations performed in the context of a Flutter project. We need to carefully measure their effectiveness after they become available. A more complete solution at the Dart language level will likely require migration of at least some existing code, although a few options (for example, static extension types) might make breaking changes manageable. More technical investigation is needed to fully understand the trade-offs.

How you can help

Please help us raise awareness of how to fix string issues using the characters package:

Look for instances of using String.length or String.substring in your own code. If the string might have originated from user input, try to rewrite the code using the characters package.
Share this post with others in the Dart community.
Try to update existing answers about Dart text manipulation on StackOverflow. If the accepted answers missed this limitation of the String API, remind people of the risk.
Comment on the GitHub issues listed above to let us know your thoughts and opinions.

Now, happy coding 😉!

Acknowledgments

Thanks to Kathy Walrath, Lasse Nielsen, and Michael Thomson for reviewing this article. I would also like to thank developers who participated in our user research. Their participation helped the Dart and Flutter teams better understand the challenge of dealing with this limitation of the Dart String API.

— — —

PS: Here is the solution for the exercise:

Dart string manipulation done right 👉

Scenarios to watch out for

Scenario 1: Counting characters in a string

Scenario 2: Extracting a substring

Scenario 3: Splitting a string on an emoji

Scenario 4: Accessing a specific character by its index

Exercise: Omitting text overflow

Mitigations and possible long-term solution

How you can help

Acknowledgments

Written by Tao Dong