NLP — Tokenizing Chinese Phrases

Before doing any NLP algorithm, you need to tokenize words from a sentence or articles of strings in order to learn the meaning of texts. Like TF-IDF, you have to find words from articles in order to measure the importance of words in the selected article. Since there are a lot of NLP ideas are originated from the United States, it is natural that my NLP training is English-based. That means the skill I have to tokenize texts is good for English texts, perhaps other Romance or Germanic language is written in the Latin alphabet. I come from Hong Kong and speak Cantonese Chinese, so I am interested to apply the algorithm on Chinese articles.

The nature of the Chinese Language
One of the differences between Chinese and English is that Chinese is written in characters. The characters in Chinese represent meaning(s) of an idea but they usually do not tell you how to pronounce the character, that means Chinese characters store ideas while English phrases store pronunciation.

There are two types of Chinese characters in use today — Traditional and simplified characters. Traditional Chinese characters emerge to its written around 5th century in China and have been used until the establishment of the People’s Republic of China. China has been using simplified Chinese characters in the 1950-60s while Taiwan and Hong Kong did not adopt the conversion and still have been using Traditional Chinese characters. The problem of simplified Chinese characters is that some simplified characters have the same meaning of two or more traditional Chinese characters. For example, the name of a Hong Kong news commentator is 陶傑 in traditional characters, his name in simplified characters is 陶杰. However, the character 杰 could mean 傑 or 杰 in traditional characters. Therefore, if you try to translate 陶杰 to traditional characters you may have translated the words in mistake as you don’t know if whether the right translation is 陶傑 or 陶杰.

The major difference between Chinese and English is that structure of writing. The problem of NLP in Chinese is: If you tokenize Chinese characters from the articles, there is no whitespace in between phrases in Chinese so simply split based on whitespace like in English may not work as well as in English. What I mean is it could work, but very likely not. I will explain what it could work and what to do if it does not work. Note that I will do the experiment in traditional Chinese characters to avoid the problem of simplified Chinese.

Tokenizing individual Chinese characters of an ancient Chinese Poem
The reason I said tokenizing individual Chineses characters may work is that each Chinese characters represent a meaning of an idea. It is true for ancient Chinese. If you can speak Chinese, you may realize Chinese phases are usually makeup of 2 or more characters but this was not the case for ancient literature. Therefore, tokenizing individual Chineses characters may work on the ancient poems. Let’s look at “Yearning” from the Tang dynasty:

紅豆生南國,春來發幾枝? 願君多採擷,此物最相思。

The poem in English means:
The red beans are grown in the Southern part of the country,
how many red beans would sprout this year?
You should harvest more.
The harvested red beans have brought the memory of you.

Let’s tokenize this line of poem. First, remove all the punctuation with regex. Then split the characters individually. In Python, .split() is not able to split Chinese characters. If the variable of the poem text is named “texts”, the trick is to use list() to split the string.

tokens = list(texts)

In order to see if the token works, I shuffle the order of the characters and see if each character has a proper meaning. The result is:
君 — means gentleman, could use as “you” in polite form
此 — means this/that
春 — means spring
國 — means country/nation
生 — means to live/to grow
If I use those 5 characters to map the meaning to the original poem, it is possible to find similar meaning to the text. You may find a similar meaning of 君,春,國,生 from my English translation. I want to point of 此 behave preposition so I did not include in the English translation. This word is from 此物最相思 which literally mean “this thing most think of (a person)”.

This tokenization works on Tang dynasty poems but not necessary works on the remaining classic literature. When applying such method, you should double check if the token actually represents any meaning. But you can expect the later of the literature would not work as later literature tend to use more phrases makeup with 2 or more characters.

Tokenizing individual Chinese characters of a Hong Kong news article
If we look at the news written in Chinese, then you may find individual character tokenization do not work at all. An example of a news article from (Please find the link of the article at the bottom of the post):


The paragraph means: About the long term effect on the trade war, Lam hang-tsz thinks Chinese manufactures would move to other countries, such as Vietnam, Laos, India, Taiwan, or even Malaysia to prevent tariff. The effect of inflated price increased caused by increased tariff will not last long after the fact that the manufacturers have moved away from China. Because of the cost of manufacture in the mentioned countries is less than Chinese goods, by that time, the retail price could possibly be cheaper than products made in China.

Individual character tokenization will not work in this passage because:
1. Name involve
There is one name mentioned: 林行止. If you split word by word, literally means “forest, walk, halt”.

2. Phrases are makeup of two or more characters
For example:
進口 means import, but word by word means “get into the mouth”
馬來西亞 means Malaysia, but word by word means “Horse come to Western Asia”

3. Confusion on phrases
Modern countries in Chinese name has a format of “name-country”. For example, the Chinese name of “United States of America” means “America country”. If we have a character represent “country”, that it does not refer to which countries mentioned in the passage, while there are 2 candidates — China(中國) and Laos(寮國).

Using the package jieba to tokenize
We can now agree that individual character tokenization does not work in modern Chinese, the problem of phrases in Chinese usually consist of two or more characters. I read Ceshine Lee’s Medium post which suggests 4 way to tokenize Chinese phrases, I have tried individual character tokenization which is the first suggested way. The second way is to the package “jieba” to tokenize Chinese phases.

I tried to use jieba to token the same 852 post article and the result is not bad. You can see phases are tokenized successfully, like “trade war”, “China”, “factory”, “tariff”, “India”, “Taiwan”(Chinese phases: 貿易戰, 中國, 工廠, 關稅, 印度, 台灣, respectively). However, there are still some phases did not capture properly, such as “long term”, “Malaysia”, “Laos” (Proper Chinese phases: 長遠, 馬來西亞, 寮國, respectively). One possible reason is the 寮國 is the name not how mainland Chinese called Laos that the model in jieba is not able to detect that Chinese phrases means Laos. Ceshine mentioned that the available segmentation tools do not have enough training data to support the models for tokenization. That’s why we see poor performance of jieba that disappoint us.

How about Cantonese?
Sometimes Hong Kongers write Chinese in Cantonese phrases that they write Chinese with characters which mainland Chinese or Taiwanese do not write. Let’s take an example of a sentence from a blog from 852 Post:

“Hong Kongers are familiar with Central (A district in Hong Kong). However, it is a very special place for me.”

The original text is written in Chinese as below, let’s call it “Case 1”:


The translation to case 1 is written as below, let’s call it “Case 2”:


Jieba is written by Chinese from mainland China and do not speak Cantonese. Therefore, I would expect the training data they used do not include any articles originated from Hong Kong, thus I guess the performance is not well when handling Cantonese exclusive phrases.

The Cantonese exclusive phrases you find are 係, 嘅, 嚟 which are equivalent to 是, 的, 來 in standard written Chinese.

The result between case 1 and case 2 are not identical as expect. In case 1, there are more tokens with only 1 character that those tokens should be combined with another character as a meaningful phrase. The 3 difference are 來講, 特別, 意義 which case 2 was about to capture. We can expect jieba works better on standard Chinese than written Cantonese. Besides those, jieba was able to capture some core phases in both cases. I am surprised jieba was able to capture the names like Central(中環) and Hong Kong (香港) that I did not expect.

Ceshine also mentioned another package THULAC analyzer as an alternative of jieba which is popular but slow. He also suggested two more ways to tokenize Chinese passages. You may find the link to his post at the bottom of the post.

There are more challenges on preprocessing Chinese than English in NLP but there are some useful packages available to use. Individual character tokenization may work on classic works but very poor on modern Chinese. Jieba is a good package to use for preprocessing but it is best not to use this package to preprocess written Cantonese. The users of the Chinese language are very diverse, so the data which train the model of jieba do not handle non-standard Chinese well. Knowing Chinese, of course, helps to determine the performance of jieba, you may also have to make sure the data you have are written in standard Chinese.

My other thought of tokenizing the written Cantonese experiment is about technology research in Hong Kong. The atmosphere of innovation in Hong Kong is not the same as California as there are not a lot of techie pioneers in Hong Kong to explore the unknown in tech. At least, there is no open source segmentation tool, a library of stop words, or Stanford GloVe equivalent available in Hong Kong. If the Hong Kong government really want to encourage innovation, it should provide resource for universities to offer opens source NLP tools, a good tool I want very bad for NLP in Chinese is something equivalent to Stanford GloVe but in Chinese (The Chinese University of Hong Kong should work on this).


Article from 852 Post (4th Paragraph)

Blog from 852 Post (1st line)

Documentation on jieba

Ceshine Lee’s medium post, [NLP] Four Ways to Tokenize Chinese Documents



A data engineer in a BI company with Czech heritage, a whisky-lover, an aviation enthusiast, and a gamer. Concern how to use data science to answer questions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jacques Sham

A data engineer in a BI company with Czech heritage, a whisky-lover, an aviation enthusiast, and a gamer. Concern how to use data science to answer questions.