Making a Thai byte-level language model

When word and sub-word don’t do the job

Nick Doiron
CodeX
4 min readJul 4, 2021

--

A Thai BERT model that I’d adapted recently appeared in a paper on cross-language learning in biomedicine. This same model is in the spaCy docs. People use it on GitHub and Kaggle. As in most NLP models, I had used a tokenizer which splits text into a stream of words and sub-word prefixes and suffixes, with words separated by spaces before being broken into prefixes and suffixes. But this assumption falls short in many global languages. Thai, for example, does not put spaces between [most] words.

WangchanBERTa, released by the AI Research Institute of Thailand in 2021, is a better model. The model card shows the steps needed for preprocessing with the PyThaiNLP library. I’d recommend that over my own. But often, coders are looking for models they can drop into an existing pipeline. We could add to the docs or support more tokenizers in HuggingFace, but then two new approaches caught my attention…

Enter Charformer and ByT5

In June 2021, separate groups at Google Research posted papers about character- and byte-level transformer models. After computational linguists worked so hard on parsing text into meaningful morphemes and embeddings, the idea that ‘boba’ and ‘baboon’ should be split into individual letters or bytes for a model to derive its own rules… it’s a major break with tradition. The architecture of the first network layers has to change, too (from tens of thousands of word embeddings to a handful). But after digging into tokenizer issues on other models maybe this could be an improvement?

Colin Raffel, author on the ByT5 paper:

Yi Tay, an author on the Charformer paper:

My takeaway is that Google is involved in these models because a prescriptive token vocabulary is an obstacle for super-all-purpose tasks like Google Search. The word list for English alone would be so huge, and language diversity so challenging, that they want to let the model figure out a neural solution.

I should mention that byte-level tokenization is not new, with Facebook Research exploring it for translation in late 2019, wav2vec doing something similar in speech-to-text, and another Google paper CANINE earlier in 2021 (I liked this paper for grouping languages including Arabic, Greenlandic, and Thai on how they could benefit).

Pre-training and Fine-tuning ByT5

Lucky for us, ByT5 comes pre-trained on mC4, a multilingual dataset which includes Thai. If you want to pre-train on Thai only, unfortunately it takes major resources to read and split one language off from mC4 (Korean example here).
If you want to pre-train on your own dataset or a language left out of mC4, I started a notebook here. I do hope to try this for Dhivehi in the near future.

Most users might want to drop ByT5-large into their fine-tuning pipeline and call it a day. Unluckily most T5 code samples are designed for text-to-text (e.g. HuggingFace’s T5ForConditionalGeneration) and making a classifier requires writing some PyTorch. I borrowed heavily from Suraj Patil’s T5 notebooks here with my updates:

  • accommodating changes to the T5 model
  • newer versions of dependencies
  • using HuggingFace’s datasets library in place of the original CSVs.

Final notebook — after issues training on IMDB and Wongnai datasets, I fine-tuned byt5-small on the Wisesight Sentiment dataset. On a Google Cloud A100 you can upgrade to byt5-base but not byt5-largeor bigger.

Someone who reviewed my notebook recommended a T5 training library which is less DIY — it will support ByT5 soon: pypi.org/project/simplet5/

figure from ByT5 paper

Should we use byte models now? The ByT5 paper admits that mT5 (a multilingual model with a more standard tokenizer) still performed better on Thai XNLI.

languages with a higher SentencePiece token compression rate (e.g. Thai and Telugu) tend to favor mT5, whereas those with a lower compression rate… tend to favor ByT5

I wonder if a Thai-exclusive or longer-trained model would show some better results here, or if each model performs better on different tasks.

Charformer

The code for Charformer was just published on GitHub on June 29th, so there isn’t a HuggingFace model available yet (issue).
The main development is a soft gradient-based subword tokenization (GBST) module trained and added to the start of a larger transformer neural network. This video by Letitia Parcalabescu helps explain the concept.

Searching GitHub for some unofficial implementations, I found this text embedding library by Chenghao Mou. Everything is still under development and takes PyTorch Nightly, but the examples give an idea of how to incorporate Charformer embeddings into PyTorch Lightning modules.

Updates?

This article was posted in July 2021. For my latest recommendations, check the Thai NLP section of this GitHub Readme.

--

--