Traditional Chinese Word Embeddings from Hong Kong Data

5 min readMay 30, 2019

Words as Numbers

Word embeddings are distributed representation of words as a set of numbers. They are also called word vectors. The use of word embeddings in deep learning gained huge popularity after the seminal Mikolov et al. (2013) Word2vec paper. Word vectors in the paper were able to demonstrate this kind of relationship:

King - Man + Woman = Queen

Basically, the difference between the words “King” and “Queen” are very close to the difference between “Man” and “Woman”, which intuitively makes sense. The amazing thing is that the vectors were computed purely by a mathematical model over a large text corpus without any input about the meaning of these words.

After the Word2vec paper, a lot of people have come up with other ways to calculate these vectors, like GloVe by Stanford or fastText by Facebook. Since 2018 though, transfer learning models like ELMo and BERT has replaced word embeddings as state of the art in natural language processing. People may soon be able to simply download entire pre-trained networks and fine-tune them for their specific purposes. For now however, these models are big and slow so they are not quite usable for learning purposes. It is still much easier to download word vectors in a file and load them in models.

Pre-trained Traditional Chinese Word Vectors

It is very easy to create vectors ourselves like demonstrated in this tutorial, but it takes a lot of resources to create vectors with enough coverage for general usage. Some universities and companies have shared their vectors to the public. These pre-trained vectors are usually in English. There are also some in Simplified Chinese, but it is difficult to find them in Traditional Chinese.

Right now, I know of two sources of pre-trained word vectors in Traditional Chinese:

Facebook Word vectors for 157 languages:

fastText CBOW 300 dimensions
based on Common Crawl (May 2017) and Wikipedia (September 2017)

University of Oslo NLPL:

Word2Vec Continuous Skipgram 100 dimensions
based on ChineseT CoNLL17 corpus (March 2017) which includes Common Crawl and Wikipedia

I created my own vectors using data from my website, which I will just call ToastyNews vectors in this post:

fastText CBOW 300 dimensions
based on ToastyNews crawled HK sites (late 2015 — March 2019), Yue Wikipedia (March 2019), The Encyclopedia of Virtual Communities in Hong Kong (April 2017)

Each of these are different by an order of magnitude. But is bigger always better? A common way to evaluate vectors is to test the distance between the words against some well-known lists and see if they match human expectations. There are standard sets of similarity and analogy questions in English. For Traditional Chinese, Su et al. (2017) translated them and made them available on GitHub. Note that these are Taiwan translations so they might not work perfectly with Hong Kong data sources.

Similarity Evaluation

Similarity uses lists of human-tagged pairs of words and a similarity score like this:

李白詩 9.2
蛋白質文物 0.15

Spearman’s correlation against the lists. Higher is better.

Surprisingly, the smallest ToastyNews vectors did well against the other two. One possible reason is that the Facebook and UiO vectors are both based on Wikipedia and Common Crawl, which might contain lower quality data when compared to ToastyNews, which consists of longer articles.

Analogy Evaluation

Analogy uses two pairs of words with a defined relationship to see if the vectors also contains these relationships. They are quite factual, like what one could find in Wikipedia.

Capital: 雅典希臘, 巴格達伊拉克
City: 南京江蘇, 福州福建
Family: 爸爸媽媽, 新郎新娘

Accuracy against the analogy list. Higher is better.

As expected, larger corpus do better because they are more likely to contain the information. Both Facebook and UiO uses the Zh Wikipedia (1.7 GB compressed), which is orders of magnitude larger than the Yue Wikipedia (52.7 MB compressed) used by ToastyNews.

Real Examples

It is still quite unclear what these numbers really mean. Let’s load the vectors up and do some queries to see how they do for my own use cases.

First, similarities:

The most similar word to the example word.

ToastyNews vectors are generally expansion of abbreviations or words of similar meaning. Facebook and UiO tend to be closely associated words but with different meanings. For example 毒男, Facebook and UiO got 北姑 and 港女 which are generally derogatory terms, while ToastyNews got 宅男 which is the less colloquial version of the same word. This is probably because the ToastyNews internal vocabulary is more focused on Hong Kong.

Somewhat strange is how both Facebook and UiO have very strong opinion about the political terms 佔中, 港獨, 左膠. For the term 六四, Facebook and UiO looks at the word in historical context while ToastyNews looks at it from the current Hong Kong perspective.

Analogies:

“King - Man = Queen - ?” analogy predictions.

I like how they all agree on relationship between Standard Chinese and Hong Kongese 係 - 是 + 嘅 = 的, it must be very obvious statistically. It should be noted that a common problem with these vectors is that they get outdated information. For example, 梁振英 and 歐巴馬 have both left their positions for some time already. Also, 歐巴馬 is actually the Taiwanese translation of Obama. 奧巴馬 is the Hong Kongese translation.

Next Steps

So far, we have used intrinsic evaluations based only on the vectors themselves. The numbers seems to indicate ToastyNews is good for processing Hong Kong data and Facebook for general data. The next step is to see how they perform in real problems. This is called extrinsic evaluation. Various people have attempted to create standard sets of these evaluations in English, but there are no standards for Chinese. In the next post, I am going to define a sentiment analysis task and try out these vectors to see if they help improve the results.

Technical Resources

Pre-trained vectors:

fastText — Chinese (txt is the format used in the notebooks)
UiO — ID 35 Word2Vec Continuous Skipgram with ChineseT CoNLL17 corpus
ToastyNews — Vectors (txt is the format used in the notebooks)

Notebooks to play with the real examples:

Evaluation

Traditional Chinese intrinsic evaluation files — data directory
Scripts to use the evaluation files — src directory