Dart string manipulation done right 👉
Like many other programming languages designed before emojis started to dominate our daily communications and the rise of multilingual support in commercial apps, Dart represents a string as a sequence of UTF-16 code units. The encoding worked fine in most cases, until increased internationalization and the introduction of emojis that go with any language made the encoding’s inherent problems become everybody’s problems.
Consider this example:
In the string “Hello👋”, each user-perceivable character is mapped to a single code unit except the waving hand emoji 👋. An immediate consequence of this mapping is confusion over the length of this string. Will the output of the following line of code be 6 or 7?
To the user, there are clearly 6 characters in this string unless you get philosophical. But the Dart
String API will tell you that the
length is 7, or to be precise, 7 UTF-16 code units. This difference has all kinds of ramifications, because so many text manipulation tasks involve using character indexes with the
String API. For example,
"Hello👋" won’t return the 👋 emoji. Instead it will return a malformed character representing the first code unit of the emoji.
The good news is that Dart has a new package called characters that operates on user-perceivable characters instead of UTF-16 code units. However, you, as a Dart programmer, need to know when to use the
characters package. Our research indicates that even experienced Dart programmers can easily miss such problems when reading text manipulation code. In this article, I go over some common scenarios where you need to pay extra attention and consider using the
characters package instead of Dart
Scenarios to watch out for
In this section, I’ll go over a few common text manipulation scenarios, explain why using Dart’s
String API could cause problems in these scenarios, and show how to use the
characters package for more reliable results. The use cases below generally assume that we’re dealing with strings entered by human users, which could include emojis or characters in a language the app developer doesn’t expect.
Scenario 1: Counting characters in a string
Suppose you’re writing a function that checks if the text entered by the user has exceeded a specific number of characters. The function returns a positive number of remaining characters if the limit hasn’t been reached, or a negative number of extra characters if the limit has been exceeded.
This is pretty straightforward to do using the
However, the following test reveals the problem with this code:
Here are the testing results:
We can rewrite this function using the
characters package, which provides a convenient extension method on
String, to produce the correct number of characters as follows:
Scenario 2: Extracting a substring
In this scenario, we want to implement a function that deletes the last character from a string and returns the result as a new string. Let’s assume this string comes from user input.
This function is easy to implement using the
substring method on
String as follows:
However, a good emoji test can quickly break the code:
Here are the testing results:
Expected: ‘Hi ’
Actual: ‘Hi 🇩???’
Which: is different. Both strings start the same, but the actual value also has the following trailing characters: 🇩???
characters package can handle this case with ease, as it provides high-level methods such as
skipLast(int count). We can rewrite this snippet into the following code:
Scenario 3: Splitting a string on an emoji
In the third scenario, we want to split a string on a given emoji. Here is a function doing that using the split method on
Would it work? It probably will work just fine 99% of the time, but the test below illustrates an example where the above code produces rather surprising results.
Here are the testing results:
Expected: ['abc👨👩👧👦', 'abc', 'abc', 'abc']
Actual: ['abc👨👩','👦', 'abc', 'abc', 'abc']
Which: was 'abc👨👩' instead of 'abc👨👩👧👦' at location 
So, why did 👨👩👧👦 become two emojis 👨👩 when the string was split? It’s because 👨👩👧👦 is actually made of four different emojis: 👨👩👧👦. When the string was split on 👧, “abc👨👩👧👦” got separated into two parts: “abc👨👩” and “👦”.
You can avoid this issue by using the
split method on the
Characters class, as the following code shows:
Scenario 4: Accessing a specific character by its index
In text manipulation, it’s common to access a specific character by its index (i.e., position) in the string. For example, the snippet shows a function that returns initials from the first name and the last name entered by the user in two separate text fields:
But as we have demonstrated in the beginning of the article, using the index in a UTF-16-based string could be risky. Let’s verify the correctness of the above code with the test case below:
Here are the test results:
Which: is different.
Why did the test fail? It’s because the letter “É” could be a combination of “E” and the accent mark. You can use the
characters package to easily avoid this problem:
Exercise: Omitting text overflow
Now, here’s a challenge for you. In this scenario, the app needs to display a list of messages, one per line. You’re asked to review code that implements a function that displays text overflow as an ellipsis when the message’s length exceeds the given character limit.
Can you come up with a test to reveal a potential issue with this code snippet? How would you rewrite it using the
characters package? The answer is at the end of this article.
Mitigations and possible long-term solution
It’s unreasonable to expect Dart users to stay on high alert for the kinds of pitfalls described above. For example, in an experiment we conducted, 53.7% of Dart users were unable to detect the problem illustrated in the first scenario (counting characters), even though they received two pages of information about the
characters package and the problem the package was designed to address just a few minutes before. Therefore, we are taking a two-staged approach to helping developers choose the most appropriate API for their text manipulation needs.
In the short term, we are introducing a set of mitigations in the Flutter framework and the Dart analyzer to make the
characters package easier to discover and invoke in Dart UI programming. This involves a few steps:
- Use the
characterspackage in the internal implementation of the
TextFieldwidget. See this PR and this design doc for more details.
- Expose the API of the
characterspackage through the Flutter framework. Once this is done, Flutter users will have a higher chance of discovering the API through the extension method
String.characters, which will show up when doing an autocomplete on
String. The status of this work is tracked in this issue: https://github.com/flutter/flutter/issues/55593.
- Update the Flutter framework’s API documentation and sample code to suggest using the
Charactersclass when applicable, such as in the callback for
TextField.onChanged. This work is tracked in https://github.com/flutter/flutter/issues/55598 with relevant details in this doc.
- Have the Dart analyzer suggest converting a
Stringobject to a
Charactersobject when autocompleting a callback template for handling user-entered text. For example, the IDE could fill out everything in the snippet below after the user autocompletes on
onChanged. This work is tracked in https://github.com/dart-lang/sdk/issues/41677.
Those mitigations can help, but they are limited to string manipulations performed in the context of a Flutter project. We need to carefully measure their effectiveness after they become available. A more complete solution at the Dart language level will likely require migration of at least some existing code, although a few options (for example, static extension types) might make breaking changes manageable. More technical investigation is needed to fully understand the trade-offs.
How you can help
Please help us raise awareness of how to fix string issues using the
- Look for instances of using
String.substringin your own code. If the string might have originated from user input, try to rewrite the code using the
- Share this post with others in the Dart community.
- Try to update existing answers about Dart text manipulation on StackOverflow. If the accepted answers missed this limitation of the
StringAPI, remind people of the risk.
- Comment on the GitHub issues listed above to let us know your thoughts and opinions.
Now, happy coding 😉!
Thanks to Kathy Walrath, Lasse Nielsen, and Michael Thomson for reviewing this article. I would also like to thank developers who participated in our user research. Their participation helped the Dart and Flutter teams better understand the challenge of dealing with this limitation of the Dart
— — —
PS: Here is the solution for the exercise: