Chinese Chars for Coders

This week, I started a Mandarin Chinese class. I’m still working on basic phrases, but from research I can clear up questions about Chinese writing, Simplified vs Traditional Chinese, Cantonese, Taiwanese, Unicode, user-agents, and other languages that have adopted the Chinese writing system.

Meaningful, reusable characters ♻️️

Two common signs that you will notice in China are entrance (入口) and exit (出口). You’ll see that the first sign changed (the sign representing entering / exiting action) and the second sign remained the same (a box which also means mouth).

The Mandarin words for entrance and exit are rù-kǒu and chū-kǒu respectively, so the spoken words and written forms match. I’m writing the pronunciation in Pinyin here, where the accent marks represent rising and falling tones.

Once you know some characters, you will see them in other words… for example, my apartment’s water heater has labels 入水 and 出水 for water input and output.

Simplified and Traditional Chinese 🍜

simplified (top) and traditional (bottom) — can you spot the differences?

Simplified versus Traditional Chinese is the most common divide in supporting written Chinese on the web.

Simplified characters were chosen by the People’s Republic of China and are officially adopted by Singapore and Malaysia now, too. This was part of a purposeful effort to standardize written and spoken language across China, so using Simplified is associated first with PRC and the sphere of influence of PRC, and less as a linguistic happening.

Due to historical divides, Traditional characters remain in use in Taiwan, Hong Kong, Macau, and US Chinatowns. On Quora someone talks about it making a comeback in PRC, too. Just don’t mess up:

Tech Tips

On the web, Chinese user-agents will often signal a preference using a region code: zh, zh-CN, zh-SG, zh-HK, or zh-TW. Other sources suggest using zh-Hans or zh-Hant to mean simplified or traditional, but I don’t believe that this has caught on yet.

When I copy 入口 from Google Translate’s Traditional Chinese and paste it here on Medium, I see it change to simplified characters because of different font and lang settings. Unfortunately simplified and traditional cannot be reduced to a font issue. Unicode determined that some characters should have one codepoint, and others should have separate codepoints (more on 丟 vs 丢 later). So keep your translation files separate!

Cantonese 🇭🇰 🇲🇴

In Hong Kong, Macau, and the surrounding Guangdong province, more people speak Cantonese. There were once many Chinese languages, but now there are only a few strong holdouts. Cantonese is also common in American cities’ Chinatowns, as explained in detail on Wikipedia (NYC, SF).

Differences in the languages carry over into long-form writing, but in the miniature examples which we learned so far, the signs don’t change. You still see 入口 and 出口, even though Cantonese speakers read them as jap6-hau2 and ceot1-hau2. (There is not as standardized of a system to teach and write Cantonese words for English speakers, so I copied Wiktionary).

When I asked Mandarin-speaking friends from China and Taiwan about learning Cantonese, I was told that it is too difficult, that it has additional tones, that it has limited use and opportunities, and even “I think it is only a spoken language.” This way of thinking leads to pushback from Cantonese-speaking areas, and is a popular subject for Hong Kong in particular:

Macau depends on PRC tourists and investors, and has less interest in politicking to avoid Simplified Chinese. I found an editorial which talks about both sides of the issue, including McDonalds and casinos changing signage since 2012.

Tech Tips

Remember to use Traditional characters in Hong Kong and Macau, and (if possible) Simplified characters for Cantonese-speaking users in Guangdong.
Some Hong Kong users might prefer English and some Macau users might prefer Portuguese, due to the history there.

It appears that zh-HK is the most common suggestion to represent Cantonese on the web / user-agents, even though it suggests a small language variant like en-US vs. en-UK. In the article below, the options zh-yue, plain yue, and also Shanghai-nese are discussed but admitted to be rare.

There have been rumors since 2015 that Google Translate is working on a Cantonese translation system:

Taiwan 🇹🇼

If you’ve made it this far, you should be able to follow Taiwan’s reasons for using Mandarin Chinese and Traditional characters.

Another official language is Taiwanese Hokkien, which Wikipedia says is “spoken natively by about 70% of the population.” This language came to Taiwan (then Formosa) starting in the 1600s, before the Republic of China.

Tech Tips

Don’t require Taiwanese people to click on a flag of PRC or Hong Kong to see content in their language.

Singapore 🇸🇬

This is a great Wikipedia article about how Singapore decided to follow PRC in promoting speaking Mandarin and writing Simplified characters. Originally Singaporean Chinese mostly spoke Hokkien (similar to Taiwan). Under government initiatives, in 1980 Mandarin was spoken in 10% of homes, and by 2010 was in almost 50% of homes (with other households speaking English and other non-Chinese languages).

Despite the official party line, I’ve read that Singapore allows you to fill out a birth certificate with a Chinese name in either Simplified or Traditional characters.

Tech Tips

Be like Singapore. Let people enter their name in the format that they like.

Japanese 🇯🇵

Remember 入口 and 出口? The Japanese write these words with the same characters, but say them in another way (iriguchi and deguchi).

There are other key differences in writing:

  • Japanese uses a much more limited set of kanji (2,136 ‘commonly-used’ characters). They also mix in phonetic characters: hiragana and katakana.
    The words for entrance and exit would be いりぐち and でぐち in hiragana. Katakana is used for foreign words and names, eg Nicholas: ニコラス
  • Japanese speaking and grammar is closer-related to Korean and Mongolian. Writing came later, so Chinese characters are used for Japanese words, but a written Japanese sentence would not make sense or have the same word order as a written Chinese sentence.
  • Many Japanese characters appear differently from Simplified or Traditional Chinese, so you need to have the right font or language settings, or in some cases use different Unicode codepoints.

Tech Tips

Use the right font and lang=“ja” or lang=“ja-JP” tags to make sure you’re writing Japanese, not Chinese.

I’ve heard that Japanese people overwhelmingly prefer to use online pseudonyms and not ‘real’ names, but I know very little about this.

Unicode 💩

At the time that Chinese, Japanese, and Korean characters were added to Unicode, they did not plan to have additional bytes available to support more languages (like emoji!). Including multiple copies of each Chinese character was unacceptable, so experts narrowed the first block down to about 21,000 characters. If you want to read more about this process and Japanese opposition to Unicode, read up on Han Unification.

one Unicode codepoint in Traditional, Simplified, Korean, Vietnamese, and Japanese contexts

My personal conspiracy theory is that emoji was promoted in Japan and allowed into Unicode to persuade Japan to drop other encodings and accept Unicode for interoperability.
Anyway, there are now over 80,000 catalogued characters, including some consolidated characters, some differentiated ones (丟 and 丢), and 5,762 rare and historical characters which were just added in 2015 (74 page PDF warning). There’s continuing research in this space, so expect more to be added. If you do need to support these in your system, choose a font carefully and make sure your database can handle utf8mb4 encoding… same as emoji. 😃

For additional Chinese language support advice, including vertical text and webfonts, I recommend Chen Hui Jing’s guides: written and video.