ELECTRA from Hong Kong Data

出嚟食飯
5 min readApr 14, 2020

--

I had wanted to experiment with BERT for some time. These are cutting edge language models but require a lot of memory and compute to train. So I focused on simpler things like word embeddings. Recently, Stanford and Google Brain released a new model called ELECTRA. The main difference between this model and other transformer-based models is that it is formulated as a discriminator. This model needs much fewer parameters and as a result, less memory and compute to train. The small version of the model is so small that it is realistic to train with a mid-range GPU. I thought I’ll give it a try with the data I collected for Toasty News.

ELECTRA was trained on English data only. Luckily, the Joint Laboratory of HIT and iFLYTEK Research (HFL) quickly published Chinese-ELECTRA. Like their BERT models, Chinese-ELECTRA was trained on massive amount of data. This is the benchmark model that I will use to compare the performance of my model.

Why train a different model from Chinese-ELECTRA?

Hong Kong is a diglossia where the written language is different from the spoken language. The written language is Standard Chinese, and the spoken language is Hong Kong Cantonese/Yue (Hongkongese). Although Hongkongese is mostly only spoken, it is in increasing use for informal communication like in forums or blogs. In 2019, 5.1% of articles among Hong Kong Internet media sites were written in Hongkongese. This percentage is much higher in forums like LIHKG.

To give some perspective, in the OSCAR extraction of Common Crawl, the Chinese data is 249G, of which only 30MB is Hongkongese. I don’t know the mix of the data used by HFL, but I would guess the Hongkongese proportion is similar. Intuitively, it seems a model trained with a higher proportion would understand Hongkongese better.

Training the model

I have a RTX 2070 Super with 8GB of RAM, so the model I could train is quite limited. Here are some key details:

  • Number of tokens — 362 million (about 7% of Chinese-ELECTRA)
  • Sources — LIHKG, The Encyclopedia of Virtual Communities in Hong Kong, Toasty News and Yue Wikipedia
  • Hongkongese porportion — 10%
  • Batch size — 96 (compared to 1024)
  • Max sequence length— 128 (compared to 512)

The reason I have not used pure Hongkongese data is that I don’t have enough data. The sources chosen are probably 99% written by Hongkongers so I think the model should understand the Hongkonger way of writing.

Training the model was easy with the official code, just took a lot of time. It took about five days to complete training of one million steps (I was stopping and playing with it in the middle). For full details and download, check the repo at GitHub.

Chinese Evaluation

HFL publishes comparison benchmark scores of all their BERT variants. I hunt down the same data and ran the same suite. Here are the scores of my Hongkongese model (red) against Chinese-ELECTRA-small (blue).

Performance chart comparing five fine-tuning tasks, higher is better

Chinese-ELECTRA-small beats my model in every task. The classification type tasks are pretty close (bottom three). These tasks contains text in both mixed Traditional and Simplified Chinese.

The question answering tasks (top two) results are farther apart. CMRC 2018 is in Simplified Chinese and DRCD is in Traditional Chinese. My Hongkongese model probably got closer in the Traditional Chinese because Hongkongers write in the same character set. One thing to note is that the max sequence length of my model is only 128. It is sometimes not long enough to get all information needed to answer questions correctly.

Hongkongese Evaluation

How well does it do on purely Hongkongese tasks? It is difficult to answer this question because these tasks do not exist, so I created two classification tasks using scraped data.

OpenRice Sentiment Analysis

OpenRice is the biggest restaurant review site in Hong Kong. This task is to use the restaurant review subject and text to predict whether the review is positive, neutral or negative. OpenRice shows a smiley face, okay or crying face which directly corresponds to these sentiments.

Example OpenRice review

I filter the reviews to only contains those written in Hongkongese. As you can see in the example review, there are word usage like 懲罰 in Hong Kong that is in a different sense from China. Systems without contextual understanding might be confused in this case. You can get this data at this repo.

LIHKG Categorization

Example forum categories in LIHKG

LIHKG is one of the most popular forums in Hong Kong. It has different categories in which users can post in. This task is to use the forum thread subject and the text of the first post to predict the category.

This can be a pretty difficult task if the system does not have enough general knowledge. For example, a forum thread could look like this:

Subject: S20 vs Pixel 4XL

First Post: Which one better ?

If the system does not know that S20 and Pixel 4XL are phones, then it cannot predict this belongs to the Cell Phone category. This task requires broad general knowledge. You can get this data at this repo.

Hongkongese Results

Using the two Hongkongese tasks, here are the results.

Performance chart comparing two Hongkongese tasks

Chinese-ELECTRA-small still beats my model in these two tasks. Why???

I think one reason is that classification tasks in general does not require a strong understanding of grammar. The presence of certain keywords are usually sufficient to decide the class. The Chinese-ELECTRA models were trained on 15 times more data, so they have much broader general knowledge than my model, which is mostly political news.

Conclusion

Chinese-ELECTRA is very good. It beats the Hongkongese model I trained in every task. It should be the starting point if someone is starting a project even for Hongkongese.

The good news is, I trained my model with a lot less data and compute, but still achieved very close performance to Chinese-ELECTRA. It speaks to the ability of ELECTRA to learn efficiently. One little complaint about ELECTRA is that, being a discriminator model, there is not a lot of cool stuff that I can do with it, like generating a paragraph or filling in the blanks.

Some of my next steps would be to collect more diverse Hongkongese data, create more difficult tasks in Hongkonges, and maybe train a model capable of generating text like GPT-2 (just a tiny, tiny version).

Technical Resources

Pretrained models:

Hongkongese fine-tuning tasks:

--

--