The One Thing You Forgot While Internationalizing Your Application

Ken Richards
The Startup
Published in
4 min readSep 8, 2020

The hidden bugs waiting to be found by your international users

While internationalizing our applications, we focus on the things we can see: text, tool-tips, error messages, and the like. But, hidden in our code there are places requiring internationalization that tend to be missed until found by our international users and reported as a bug.

Here’s a big one: regular expressions. You likely use these handy, flexible, programming features to parse text entered by users. If your regular expressions are not internationalized, more specifically, if they are not written to handle Unicode characters, they will fail in subtle ways.

Here’s an example: imagine a commenting system in your application that allows users to type at-mentions of other users or user groups. People at-mentioned are notified that the comment needs their attention. Your system may have the requirement that the at-mention format is something like:

Writing a regular expression to find and parse the usernames out of these strings is the most direct way for handling this. In Java, JavaScript, and other languages, the regular expression might look like this:

This expression specifies that we’re looking for an ‘@’ followed by a letter or number, followed by one or more letters, numbers, dashes, underscores, or dots, and ending with a letter or number. The parentheses tell the expression to capture this string and return it to us.

We can test it using the regex101 tester:

https://regex101.com/r/gVNS9f/1/

So that regex works great! But now let’s test it against some comment text containing Unicode characters:

“This comment mentions @Adriàn, @François, @Noël, @David, and @ひなた”

https://regex101.com/r/b4ZGY2/2/

Unicode characters are not matched, so we either get incomplete usernames or no username at all.

The solution:

Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead.”

http://www.regular-expressions.info/unicode.html

It would seem incredibly difficult to write a regular expression encompassing the Unicode mission statement quoted above, but it’s fairly straight forward. To match a single letter grapheme (a complete letter as rendered on screen), we use the \p{L} notation.

Updating our regex to use this Unicode friendly notation for letters, we get:

Let’s try it out in the regex101 tester:

https://regex101.com/r/b4ZGY2/1

Close! But @Adriàn is not getting fully parsed. In fact, the string returned from the capture group is ‘Adria’, so we’ve got an incomplete username and lost the grave accent over the a. What’s going on?

To understand this, let’s take a look at how single characters rendered on a screen or page are represented in Unicode. The à is actually two Unicode characters, U+0061 representing the a and U+0300 representing the grave accent above the a. The grave accent is a combining mark. A character can be followed by any number of combining marks which will be assembled together when rendered.

Fortunately, our regex can look for combining marks as well with the \p{M} specifier. This matches on a Unicode character that is a combining mark. Our usernames as defined will never start with a combining mark, but we do need to check for them in the middle and at the end of the strings. The new regex looks like this:

Testing it:

https://regex101.com/r/uV38Y6/1

Success!

One detail worth knowing is that some combined characters like the à can also be specified in Unicode with a single character (U+00E0 in this case). But with our regex, it doesn’t matter. We’ll match the character if it has a single representation, with the /p{L} specifier, or if it is a combination of two characters, with the /p{M} specifier.

As long as we’re internationalizing, let’s deal with the digits as well. Unicode regex handling gives us a safe way to match any representation of the digits 0 through 9 using the \p{Nd} specifier. Using it, we get our final internationalized regular expression for matching and returning usernames in the body of a comment’s text:

The exact details for handling Unicode in regular expressions can vary from language to language, so be sure to check out the differences for your code. The site regular-expressions.info is an excellent source for regular expression information in all programming languages and is what lead me to the solution I described in this article.

--

--

Ken Richards
The Startup

Software Engineer and Senior Team Lead @ JamaSoftware, Jazz Player, Poker Fanatic