Building a Hongkongese Word Segmenter

出嚟食飯
6 min readMar 12, 2023

--

In my previous story, I evaluated the performance of several NLP systems against Hong Kong data. One of the top performers is the CKIP Transformers. It is an interesting system, because it was built from fine-tuning a large language model (LLM) for the word segmentation task. The CKIP models are shared on the Hugging Face library, and the driver is capable of loading other models. Since I have built a Hongkongese LLM before and shared on Hugging Face, I thought I would try making a word segmenter myself using the same approach.

One advantage of fine-tuning an LLM is that the LLM already knows a lot about the language, so we just need to show examples of how we want the words to be segmented, and then it would apply it to all words in the language without having seen them in examples. This approach requires less training data compared to building a traditional system, which has to see a lot of data to build up a statistical map of word distributions.

Training Procedure

Hugging Face is a library for training and using LLMs. The libary pre-defines a list of common tasks for fine-tuning. Word segmentation is not one of them, but CKIP adopted the token classification task to use for word segmentation. CKIP uses ‘B’, ‘I’ encoding to indicate the beginning and inside of a word. The end is implied when another ‘B’ appears. Here is an example (space inserted between tokens to make it easier to read):

「 練 得 銅皮鐵骨 」 露宿 早 慣 蚊叮 => B B B BIII B BI B B BI

These examples are written to the Hugging Face file format for token classification, with one sentence per line like this:

{"words": ["點", "解", "啊", "?"], "ner": ["B", "I", "B", "B"]}

With the training file ready, it is just a matter of running the example script to fine-tune the model:

python run_ner.py --model_name_or_path toastynews/electra-hongkongese-base-discriminator --train_file finetune_hkcancor.json --output_dir toasty_base_hkcancor --do_train

This command takes care of downloading the base model, putting on a token classification head, converting the training files to the appropriate internal format, and training the model. The output is a model that is ready to use and share. The model can then be loaded by CKIP Transformers like this:

ws_driver  = CkipWordSegmenter(model_name="models/toasty_base_hkcancor")
tokens = ws_driver([text])

Pretty simple, at least the technical part.

Training Data

To recap from the previous evaluation, there are two public Hong Kong datasets that are word segmented POS tagged. They are Universal Dependencies (UD) and Hong Kong Cantonese Corpus (HKCanCor). Since PyCantonese used HKCanCor for training and is larger, I decided to do the same thing and use UD as the test set.

In addition to those two, for word segmentation only, there is a SIGHAN Second International Chinese Word Segmentation Bakeoff dataset that is commonly used for evaluation in academic papers. It comprises of 4 sub-datasets, with City University of Hong Kong (CityU) and CKIP, Academia Sinica (AS) in Traditional Standard Chinese, and each of them are pre-split into training and test sets. I had some concern about them diluting the performance of Hongkongese, but in experiments, I found adding CityU supplemented HKCanCor with more general knowledge and actually helps it perform better. Adding AS does degrade the Hongkongese performance a little bit, but it gains significantly with Taiwan text.

Benchmark Results

In the end, I got two models that I feel is satisfactory in performance, each with a small and base size:

  • Purely Hong Kong model (electra-base-hk) — Hongkongese ELECTRA model fine-tuned with HKCanCor, CityU
  • Hong Kong and Taiwan model (electra-base-hkt) — Hongkongese ELECTRA model fine-tuned with HKCanCor, CityU, AS

The following are benchmark for these two models, with the CKIP Transformers (bert-base) results for comparison.

UD yue_HK

hk is clearly better for the Hongkongese dataset.

UD zh_HK

hkt is a tiny bit better for Hong Kong Standard Chinese.

HKCanCor

These two models are trained on this dataset so it’s just here for completeness. Strangely, the small hkt model does better than the small hk model.

CityU

This dataset disagrees with UD zh_HK, with hk getting a slight edge over hkt for Hong Kong Standard Chinese.

AS

Since this is Taiwan text, the original CKIP Transformers dominates. hkt does worse and hk does much worse.

Text Examples

The text examples is from a recent post on LIHKG. Segmented tokens are space-delimited.

平時喺ig就睇得多囡囡撚貓,成日見佢地話好可愛/好療癒,於是就搵間睇下#rolling#pig 啱啱上到去就有隻貓喺門口迎賓咁企喺度都幾得意 嚟到撚貓cafe,價錢貴啲,野食難食啲都預咗 重點係睇貓呀嘛

bert-tiny

平時 喺ig 就 睇 得 多 囡囡 撚貓 , 成日 見 佢 地 話 好 可愛 / 好療癒 , 於是 就 搵 間 睇下 # rolling # pig 啱啱 上到去 就 有 隻 貓 喺 門口 迎賓 咁 企喺 度 都 幾 得意 嚟 到 撚貓 cafe , 價錢 貴 啲 , 野食 難 食 啲 都 預 咗 重點 係 睇 貓 呀 嘛

electra-small-hkt

平時 喺 ig 就 睇 得 多 囡囡 撚 貓 , 成日 見 佢地 話 好 可愛 / 好 療癒 , 於是 就 搵 間 睇 下 # rolling # pig 啱啱 上 到 去 就 有 隻 貓 喺 門口 迎賓 咁 企 喺度 都 幾 得意 嚟到 撚貓 cafe , 價錢 貴 啲 , 野食 難食 啲 都 預 咗 重點 係 睇 貓 呀嘛

electra-small-hk

平時 喺 ig 就 睇 得 多 囡囡 撚 貓 , 成日 見 佢地 話 好 可愛 / 好 療癒 , 於是 就 搵 間 睇 下 # rolling #pig 啱啱 上 到 去 就 有 隻 貓 喺 門口 迎賓 咁 企 喺度 都 幾 得意 嚟到 撚 貓 cafe , 價錢 貴 啲 , 野食 難食 啲 都 預 咗 重點 係 睇 貓 呀 嘛

bert-base

平時 喺 ig 就 睇 得 多 囡囡 撚 貓 , 成日 見 佢 地 話 好 可愛 / 好 療癒 , 於是 就 搵 間 睇下 #rolling #pig 啱啱 上到 去 就 有 隻 貓 喺 門口 迎賓 咁 企 喺 度 都 幾 得意 嚟到 撚 貓 cafe , 價錢 貴 啲 , 野食 難 食 啲 都 預 咗 重點 係 睇 貓 呀 嘛

electra-base-hkt

平時 喺 ig 就 睇 得 多 囡囡 撚 貓 , 成日 見 佢地 話 好 可愛 / 好 療癒 , 於是 就 搵 間 睇 下 # rolling #pig 啱啱 上 到 去 就 有 隻 貓 喺 門口 迎賓 咁 企 喺度 都 幾 得意 嚟到 撚貓 cafe , 價錢 貴 啲 , 野食 難食 啲 都 預 咗 重點 係 睇 貓 呀 嘛

electra-base-hk

平時 喺 ig 就 睇 得 多 囡囡 撚 貓 , 成日 見 佢地 話 好 可愛 / 好 療癒 , 於是 就 搵 間 睇 下 # rolling #pig 啱啱 上 到 去 就 有 隻 貓 喺 門口 迎賓 咁 企 喺度 都 幾 得意 嚟到 撚貓 cafe , 價錢 貴 啲 , 野食 難食 啲 都 預 咗 重點 係 睇 貓 呀 嘛

Summary

With LLMs, it is possible to make a good word segmenter for Hongkongese without a lot of data. I created two models, both outperformed existing models for Hong Kong text. The hk model seems good for Hongkongese, and hkt for Standard Chinese. The cool thing is, these were created using existing frameworks like Hugging Face and CKIP Transformers. I almost did not have to write any code to get a working system. If you have your own word segmented Hongkongese data, you can add it to what I did and easily create an even better model. The following are links to reproduce and use:

Models on Hugging Face:

What’s next? CKIP Transformers also provides POS tagging and named entity recognition models. I’ll take a look at them too, but these are more difficult tasks and we have fewer datasets to work with.

--

--