The One Thing You Forgot While Internationalizing Your Application

Ken Richards
Sep 8, 2020 · 4 min read

The hidden bugs waiting to be found by your international users

Image for post
Image for post

While internationalizing our applications, we focus on the things we can see: text, tool-tips, error messages, and the like. But, hidden in our code there are places requiring internationalization that tend to be missed until found by our international users and reported as a bug.

Here’s a big one: regular expressions. You likely use these handy, flexible, programming features to parse text entered by users. If your regular expressions are not internationalized, more specifically, if they are not written to handle Unicode characters, they will fail in subtle ways.

Here’s an example: imagine a commenting system in your application that allows users to type at-mentions of other users or user groups. People at-mentioned are notified that the comment needs their attention. Your system may have the requirement that the at-mention format is something like:

Image for post
Image for post

Writing a regular expression to find and parse the usernames out of these strings is the most direct way for handling this. In Java, JavaScript, and other languages, the regular expression might look like this:

Image for post
Image for post

This expression specifies that we’re looking for an ‘@’ followed by a letter or number, followed by one or more letters, numbers, dashes, underscores, or dots, and ending with a letter or number. The parentheses tell the expression to capture this string and return it to us.

We can test it using the regex101 tester:

Image for post
Image for post

So that regex works great! But now let’s test it against some comment text containing Unicode characters:

“This comment mentions @Adriàn, @François, @Noël, @David, and @ひなた”

Image for post
Image for post

Unicode characters are not matched, so we either get incomplete usernames or no username at all.

The solution:

Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead.”

http://www.regular-expressions.info/unicode.html

It would seem incredibly difficult to write a regular expression encompassing the Unicode mission statement quoted above, but it’s fairly straight forward. To match a single letter grapheme (a complete letter as rendered on screen), we use the \p{L} notation.

Updating our regex to use this Unicode friendly notation for letters, we get:

Image for post
Image for post

Let’s try it out in the regex101 tester:

Image for post
Image for post

Close! But @Adriàn is not getting fully parsed. In fact, the string returned from the capture group is ‘Adria’, so we’ve got an incomplete username and lost the grave accent over the a. What’s going on?

To understand this, let’s take a look at how single characters rendered on a screen or page are represented in Unicode. The à is actually two Unicode characters, U+0061 representing the a and U+0300 representing the grave accent above the a. The grave accent is a combining mark. A character can be followed by any number of combining marks which will be assembled together when rendered.

Fortunately, our regex can look for combining marks as well with the \p{M} specifier. This matches on a Unicode character that is a combining mark. Our usernames as defined will never start with a combining mark, but we do need to check for them in the middle and at the end of the strings. The new regex looks like this:

Image for post
Image for post

Testing it:

Image for post
Image for post

Success!

One detail worth knowing is that some combined characters like the à can also be specified in Unicode with a single character (U+00E0 in this case). But with our regex, it doesn’t matter. We’ll match the character if it has a single representation, with the /p{L} specifier, or if it is a combination of two characters, with the /p{M} specifier.

As long as we’re internationalizing, let’s deal with the digits as well. Unicode regex handling gives us a safe way to match any representation of the digits 0 through 9 using the \p{Nd} specifier. Using it, we get our final internationalized regular expression for matching and returning usernames in the body of a comment’s text:

Image for post
Image for post

The exact details for handling Unicode in regular expressions can vary from language to language, so be sure to check out the differences for your code. The site regular-expressions.info is an excellent source for regular expression information in all programming languages and is what lead me to the solution I described in this article.

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Ken Richards

Written by

Software Engineer and Senior Team Lead @ JamaSoftware, Jazz Player, Poker Fanatic

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Ken Richards

Written by

Software Engineer and Senior Team Lead @ JamaSoftware, Jazz Player, Poker Fanatic

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store