Localization Gotchas for Asian Languages (CJK)

When you’re working on an app that you expect users from around the world to use, you can save a lot of time by thinking about localization first. It’s much more difficult to retrofit once the app has already been created.

As someone who is currently implementing full-localization on Behance, I can say that the 12 Commandments of Software Localization are a great place to start.

But I have yet to find too much in English about dealing with localizing for Asian languages that use Chinese characters. Specifically, Mandarin, Cantonese, Japanese, and Korean. (yes, Korean fonts also contain Chinese characters, in addition to Hangul).

Here are 3 fun facts that will make your life more difficult:

  • These 4 languages contain the same Chinese characters.
  • Many characters are drawn differently in each language.
  • Each language’s version of the character shares the same unicode value.
From left to right, the Simplified Chinese, Traditional Chinese, and shared Japanese/Korean forms of U+9AA8 in Source Han Sans:
Again from left to right, the Simplified Chinese, Traditional Chinese, Japanese, and Korean forms of U+66DC:

This Unicode FAQ explains the reasons and the history behind this.

This means that you need to explicitly tell whatever is rendering the characters which font to use. So you’ll need to specify a list of Japanese fonts, if you want the characters to look correct for Japanese text, etc.

So, assuming that your user’s locale is set in the browser, you might do something like the following, using the :lang pseudo class selector:

:lang(ja) {
font-family: my-english-font-stack, my-japanese-font-stack;

A real Japanese font-stack example might be something like this:

‘ヒラギノ角ゴ Pro W3’, ‘Hiragino Kaku Gothic Pro’, ‘メイリオ’, Meiryo, ‘MS Pゴシック’, ‘MS PGothic’

Where this gets tricky is when you are accepting user-created content and the user might enter text in a different language than their locale indicates. If the user types Japanese characters, unless the user selects the language of the input and it somehow gets translated to a font-selection, the browser will have no way of knowing whether it should render the unicode characters with Japanese, Korean, Mandarin, or Cantonese fonts.

Solving the problem of a user inputing an undefined language is tricky and probably beyond the scope of this article, but one way to deal with it might be to have user-selectable fonts for user-generated content. By selecting the font for their content, they would be responsible for choosing a Chinese font if the text is Chinese, etc.

Another way to handle this would be to choose the font of the user-generated content, based on the locale or language set by the user. Unfortunately, this approach isn’t perfect. What if the user-generated content is a résumé. Say the user is Japanese, but previously worked for a company in China… They might want those characters to appear in a Chinese font, even though the rest of their résumé would be in Japanese.

In short, with user-generated text, if you are only storing unicode characters without a corresponding language-specific font, you will not know the language to display the text in.