Hong Kong Transformer Models And Other NLP Resources

出嚟食飯
3 min readJul 8, 2020

--

I got access to TensorFlow Research Cloud for a month, so I spent most of that time training transformer models with Hongkongese data. This post summarize the models and other NLP resources I created so far.

ELECTRA and XLNet Models

ELECTRA is the transformer model trained with a generator/discriminator pair. It is very efficient because all tokens are trained at the same time. I trained all three sizes and make them available.

XLNet is another transformer model based on Transformer-XL, trained to understand bidirectional context. The special thing about it is the ability to generate text. I thought it might be fun to play with a model like that.

This research is supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC). I trained all models with TPUv3s using exclusively data from Hong Kong. The goal is to create models that understands Hongkongese / Cantonese / Yue. Details about training and evaluation results are available on GitHub.

The trained models are shared through the Huggingface Transformers framework. I use Transformers at work and think it’s the easiest way to get started on any NLP tasks. There are several Chinese models to fall back to if these don’t work well for you.

Consult the Transformers documentations on how to load them.

Hongkongese Evaluation Datasets

Most Chinese evaluation datasets out there are in Simplified Chinese, with just a few in Traditional Chinese. To evaluate the models for Hongkongese, I created two datasets before. Discussion of these datasets can be found in my previous post.

  • OpenRice Sentiment Analysis — classify a restaurant review as either cry face, okay face or smile face
  • LIHKG Categorization— categorize a thread based on the first post, requires broad general knowledge because common sense things are often not explained

Words HK Semantic Similarity Dataset

This time, I created another dataset based on dictionary definitions from words.hk. The task is setup as a binary classification task to infer if a dictionary word definition matches correctly. The word and definition are each paired with incorrect entries to make it pseudo-multiple choice.

Example entry in the words.hk dataset.

Words.hk is a better source than Wiktionary because these dictionary definition are written by Hongkongers to reflect Hongkongese usage. Words in the dictionary are not limited to vernacular or slang language. It is well suited to test general language knowledge of models.

Special filtering is done to make sure no character in the word appears in the definition, because a character appearing in the definition is a strong hint that they match. There is no context around the word besides the word itself, which make this basically an intrinsic evaluation of the model’s knowledge of Hongkongese semantics. This dataset is now available on GitHub.

FastText Vectors

Lastly, I updated the FastText vectors with the same training corpus as the one used to train the transformer models. The first time I shared FastText vectors was more than a year ago, which was before the anti-extradiction protests started. A lot have changed since then. The GitHub repo has been updated with the link to the latest vectors.

Word vectors may be old-fashioned, but I still use them as part of clustering news articles for Toasty News. These vectors can also be used to initialize embeddings in neural networks to improve performance.

Summary

In summary, I now have the following all based on Hong Kong data:

  • state of the art transformer models
  • different kinds of evaluation tasks
  • pretrained word embeddings

I hope these resources can help you kick start your NLP projects related to Hong Kong.

--

--