3 subword algorithms help to improve your NLP model performance

Introduction to subword

Photo by Edward Ma on Unsplash

Classic word representation cannot handle unseen word or rare word well. Character embeddings is one of the solution to overcome out-of-vocabulary (OOV). However, it may too fine-grained any missing some important information. Subword is in between word and character. It is not too fine-grained while able to handle unseen word and rare word.

For example, we can split “subword” to “sub” and “word”. In other word we use two vector (i.e. “sub” and “word”) to represent “subword”. You may argue that it uses more resource to compute it but the reality is that we can use less footprint by comparing to word representation.

This story will discuss about SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) and further discussing about different subword algorithms. The following are will be covered:

  • Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE)

Sennrich et al. (2016) proposed to use Byte Pair Encoding (BPE) to build subword dictionary. Radfor et al adopt BPE to construct subword vector to build GPT-2 in 2019.

Algorithm

  1. Prepare a large enough training data (i.e. corpus)
Algorithm of BPE (Sennrich et al., 2015)

Example

Taking “low: 5”, “lower: 2”, “newest: 6” and “widest: 3” as an example, the highest frequency subword pair is e and s. It is because we get 6 count from newest and 3 count from widest. Then new subword (es) is formed and it will become a candidate in next iteration.

In the second iteration, the next high frequency subword pair is es (generated from previous iteration )and t. It is because we get 6count from newest and 3 count from widest.

Keep iterate until built a desire size of vocabulary size or the next highest frequency pair is 1.

WordPiece

WordPiece is another word segmentation algorithm and it is similar with BPE. Schuster and Nakajima introduced WordPiece by solving Japanese and Korea voice problem in 2012. Basically, WordPiece is similar with BPE and the difference part is forming a new subword by likelihood but not the next highest frequency pair.

Algorithm

  1. Prepare a large enough training data (i.e. corpus)

Unigram Language Model

Kudo. introduced unigram language model as another algorithm for subword segmentation. One of the assumption is all subword occurrence are independently and subword sequence is produced by the product of subword occurrence probabilities. Both WordPiece and Unigram Language Model leverages languages model to build subword vocabulary.

Algorithm

  1. Prepare a large enough training data (i.e. corpus)

SentencePiece

So, any existing library which we can leverage it for our text processing? Kudo and Richardson implemented SentencePiece library. You have to train your tokenizer based on your data such that you can encode and decoding your data for downstream tasks.

First of all, preparing a plain text including your data and then triggering the following API to train the model

import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')

It is super fast and you can load the model by

sp = spm.SentencePieceProcessor()
sp.Load("m.model")

To encode your text, you just need to

sp.EncodeAsIds("This is a test")

For more examples and usages, you can access this repo.

Take Away

  • Subword balances vocabulary size and footprint. Extreme case is we can only use 26 token (i.e. character) to present all English word. 16k or 32k subwords are recommended vocabulary size to have a good result.

Like to learn?

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.

Extension Reading

Reference

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/