Classic word representation cannot handle unseen word or rare word well. Character embeddings is one of the solution to overcome out-of-vocabulary (OOV). However, it may too fine-grained any missing some important information. Subword is in between word and character. It is not too fine-grained while able to handle unseen word and rare word.
For example, we can split “subword” to “sub” and “word”. In other word we use two vector (i.e. “sub” and “word”) to represent “subword”. You may argue that it uses more resource to compute it but the reality is that we can use less footprint by comparing to word representation.
This story will discuss about SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) and further discussing about different subword algorithms. The following are will be covered:
- Byte Pair Encoding (BPE)
- Unigram Language Model
Byte Pair Encoding (BPE)
Sennrich et al. (2016) proposed to use Byte Pair Encoding (BPE) to build subword dictionary. Radfor et al adopt BPE to construct subword vector to build GPT-2 in 2019.
- Prepare a large enough training data (i.e. corpus)
- Define a desired subword vocabulary size
- Split word to sequence of characters and appending suffix “</w>” to end of word with word frequency. So the basic unit is character in this stage. For example, the frequency of “low” is 5, then we rephrase it to “l o w </w>”: 5
- Generating a new subword according to the high frequency occurrence.
- Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.