Evaluating Cantonese Performance in NLP Systems

5 min readJan 4, 2023

In natural language processing(NLP), often the first step is to tokenize a string, then the second step is to annotate the tokens with part-of-speech(POS) tags. There are many Chinese word segmenters and POS taggers, but almost none for Cantonese. The strategy is usually to load a custom dictionary into a Chinese NLP system to make it work, and then do some sanity check with example sentences. This is largely because there is a lack of standard benchmarks for Cantonese processing. In this post, I describe how I used existing data to compare several popular NLP systems.

Dataset

For the benchmark, I used the UD Cantonese HK and UD Chinese HK datasets. As far as I know, none of the NLP systems I tested were trained on this data. The interesting feature with this pair of dataset is that they are parallel corpus of Cantonese and Standard Chinese from the same content. Doing well on both would mean the system can handle both written and spoken text coming out of Hong Kong. These datasets were manually segmented and annotated with Universal Dependencies(UD) style POS tags. UD POS tags are supported by PyCantonese and spaCy, so it is a convenient common denominator. In addition, these two datasets are small enough so the evaluation wouldn’t take a long time.

Another dataset I used was the Hong Kong Cantonese Corpus(HKCanCor). This is well-known and can be loaded natively with PyCantonese. In fact, PyCantonese was trained on this dataset. As such, it was only used as a sanity check to make sure the trends observed by the UD datasets are not outliers. The POS tags are in ICTCLAS style, and PyCantonese provides a converter.

NLP Systems

I evaluated the benchmark against the following systems. These are probably some of the more popular ones currently in use.

spaCy

I used spaCy for English and found it very useful. There is a Chinese pipeline so I want to know how good it does. There are three sizes of CPU pipelines and a transformer pipeline. The segmenter is actually the same one shared across all pipelines and is a pkuseg model trained on OntoNotes. The POS tagger outputs OntoNotes tags and also UD tags. I tested the small pipeline and the transformer pipeline.

pkuseg

Since spaCy uses a custom pkuseg segmenter model, I also tested the default pkuseg model. pkuseg can also do its own POS tagging with ICTCLAS tags.

cantoseg

cantoseg is one of those projects of injecting a Cantonese dictionary into a Chinese NLP system to make it handle Cantonese better. This project takes the dictionary from PyCantonese and normalize the frequencies of 嘅 against 的 from the jieba big dictionary. It is able to combine the Cantonese-specific knowledge without losing what jieba learned in general.

jieba

jieba is one of the better-known systems and is a common base for using custom dictionaries. Like pkuseg, I tested the default jieba with the big dictionary that has better support for traditional characters. jieba can do POS tagging with ICTCLAS tags.

PyCantonese

PyCantonese is a Cantonese-specific NLP system, with rich support for general linguistic research. It is very well-known in the Cantonese world. It can do word segmentation and POS tagging which outputs ICTCLAS and UD tags.

CKIP Transformers

CKIP Transformers(CKIP) is developed by the CKIP Lab of Acdemia Sinica. Based on transformers, it is probably the best traditional character native NLP system, but it is much slower than any other systems. Here I picked the best performance model bert-base. The POS tags are in CKIP style.

Evaluation Method

I used the spaCy built-in Scorer to calculate the metrics. It automatically does token alignment so it is convenient. I collected:

token_p — token precision
token_r — token recall
token_f — token f1
pos_acc — POS accuracy, this is dependent on token accuracy

(token_acc behaves strangely as of this writing and so I’m not using it)

For POS tagging, different systems output different tag sets, and the UD datasets only contains UD tags. Luckily, the UD tag set is of a lower granularity, so I decided to have everything mapped to UD tags for the evaluation. For ICTCLAS tags, I used PyCantonese as base and mapped some unhandled tags from jieba myself. For CKIP tags, I mapped all of the tags myself.

Results

Here are the results. Check the github repo for data in table form and code to reproduce.

Chart showing performance of the different systems on the UD Cantonese HK dataset. The highest are CKIP, PyCantonese and cantoseg. — Performance on UD Cantonese HK dataset

As expected, PyCantonese and cantoseg did very well on Cantonese word segmentation. However, CKIP outperformed both of them. spaCy is a step below all other systems. CKIP and PyCantonese also did well on POS tagger, but cantoseg (and jieba) did horribly.

Chart showing performance of the different systems on the UD Chinese HK dataset. The highest are CKIP and pkuseg. — Performance on UD Chinese HK dataset

On Chinese text from Hong Kong, CKIP and pkuseg did the best on word segmentation. A bit surprising is that jieba could not match these two. The jieba POS tagger continues to be quite bad. spaCy is able to match most systems on this dataset and the transformer version almost did as well as CKIP on POS tagging.

Chart showing performance of the different systems on the HKCanCor dataset. The highest are cantoseg and CKIP. — Performance on HKCanCor dataset

PyCantonese was trained on HKCanCor so it’s not a fair comparison, and is reported here just for completeness. cantoseg and CKIP did the best here, with pkuseg and jieba a step below, then spaCy another step below. It would seem HKCanCor may be the most difficult test and is able to distinguish which system can really handle Cantonese.

Summary

PyCantonese and cantoseg, both built with Cantonese in mind, perform well for Cantonese word segmentation, with PyCantonese doing also good on POS tagging. Note that what I’m call good here is really only like 85%, when most systems developed for a language should be performing at 95%+. There is still a long way to go for Cantonese NLP systems.

What’s interesting is that CKIP outperforms these two in some situations. I haven’t looked at the actual examples, but I think it may be doing well in general knowledge part while PyCantonese and cantoseg are doing well on Cantonese-specific constructs. If this conjecture is true, then there may be a case for finetuning CKIP with some Cantonese data, or using these in an ensemble setup.