Multilingual support when your Java code is written in English.

Mihaly Nagy
Groupon Product and Engineering
5 min readJul 2, 2019

I recently came across a weird issue inside our codebase, that’s not necessarily something developers think about every day. You may not even know you have this issue, until you get a one star review on PlayStore from a Turkish customer visiting Italy.

This article hopefully raises awareness about cultural differences and how they affect the way we should think about our code when targeting audiences that cross this boundary. We will take a look at the issue itself, a brief explanation why it’s not really an issue but something we should be aware of, and how to make sure you don’t repeat it.

The issue:

I will start with a concrete example: when you support multiple countries, you will most likely have a hard-coded list of countries in your codebase, possibly lowercase ISO standard encoded 2 letter “country code” (the same issue is present for uppercase too). And for some logic you have in your codebase you will check for country by comparing that string (e.g. “it” if you’re in Italy) with something that comes from the server or input by the user. Let’s say the server gives you the uppercase variant “IT”. You usually get rid of this issue by using either .equalsIgnoreCase() , or call .toLowerCase()/toUpperCase() . And this logic is perfectly fine, until you realise that there’s more going on under the hood for these methods. Most methods in String class use the “default locale” (the one set by the user), and for languages like Turkish the lowercase version of I is not i and the uppercase version of i is not I. This means “it”.toUpperCase().equals(“IT”) will not be true, neither “IT”.toLowerCase().equals(“it”), but more importantly neither “it”.equalsIgnoreCase(“IT”) . The reason why this won’t work is also in the javadoc from the String class:

Note: This method is locale sensitive, and may produce unexpected results if used for strings that are intended to be interpreted locale independently. Examples are programming language identifiers, protocol keys, and HTML tags. For instance, “TITLE”.toLowerCase() in a Turkish locale returns “t\u0131tle”, where ‘\u0131’ is the LATIN SMALL LETTER DOTLESS I character. To obtain correct results for locale insensitive strings, use toLowerCase(Locale.ENGLISH).

(source: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#toLowerCase())

How to solve it:

The solution seems pretty straight forward, and we also have a lint rule to help us out, conveniently called DefaultLocale, that checks our codebase for it:

DefaultLocale
— — — — — — -
Summary: Implied default locale in case conversion

Priority: 6 / 10
Severity: Warning
Category: Correctness

Calling String#toLowerCase() or #toUpperCase() without specifying an explicit
locale is a common source of bugs. The reason for that is that those methods
will use the current locale on the user’s device, and even though the code
appears to work correctly when you are developing the app, it will fail in
some locales. For example, in the Turkish locale, the uppercase replacement
for i is not I.

If you want the methods to just perform ASCII replacement, for example to
convert an enum name, call String#toUpperCase(Locale.US) instead. If you
really want to use the current locale, call
String#toUpperCase(Locale.getDefault()) instead.

More information:
http://developer.android.com/reference/java/util/Locale.html#default_locale

So you just put Locale.US or Locale.getDefault(), right?

That would be the short version, but the source of the issue is much deeper.

The root of all evil:

I would argue that the statement in the javadoc is incomplete, or oversimplified for convenience (e.g. intended to be interpreted locale independently). My understanding of this issue is that the strings that represent the internal state of you app (like an enum’s name), or are a part of a standard, or a constant, are in fact encoded as strings in the locale you’re writing your code in (in my case English). This means that when I check for equality, transform from lowercase to uppercase I have to use that same locale. But it’s not locale independent. It is very much dependent on the locale that it was encoded with in the first place.

While making the fix for this issue, we encountered the terms: user facing / non user facing strings. That’s also an oversimplification and not entirely true. An example would be the IBAN standard for bank account number that starts with the country code… it’s user facing yet has to be treated as English locale, otherwise “it”.toUpperCase() would become İT for Turkish locale (notice the dot over the I).

At first I thought to myself this is pretty obvious, why didn’t Java solve this more elegantly. But the more I thought about it, I realized that it’s a quite normal behavior. You need this to support the language and it’s case transformations seamlessly.

Making sure you solve the problem for good:

Now that the whole codebase is using toUpperCase(Locale) and toLowerCase(Locale) there’s absolutely nothing that prevents a new team member of just introducing another default locale call without analyzing or knowing the issue presented above.

You could add the DefaultLocale lint rule to your project, and it will probably work for most cases, but people will either ignore the violation or skim through the description and just go for Locale.getDefault() without trying to understand the issue first.

Another idea we had was to force people to use a utility class and not these methods, failing the lint task whenever code like this was detected.

I’m still not 100% sure the approaches mentioned are going to be bulletproof.

To Conclude:

Make sure you consider case changes in your app very carefully.

The locale you should use when changing case or comparing strings is dependent on the locale the original string was created in (e.g. English if you write code in English).

Use the default locale if the String is created by the user or it’s intended for display purposes only.

Automate checking for these types of errors with Lint or other code quality tools.

Hopefully this article helps people avoid making the same mistakes we made. I had no clue this could happen before I did some digging. Solving this issue definitely gave me a different perspective about cultural differences and how we should treat them in our codebase.

--

--