Classic word representation cannot handle unseen word or rare word well. Character embeddings is one of the solution to overcome out-of-vocabulary (OOV). However, it may too fine-grained any missing some important information. Subword is in between word and character. It is not too fine-grained while able to handle unseen word and rare word.
For example, we can split “subword” to “sub” and “word”. In other word we use two vector (i.e. “sub” and “word”) to represent “subword”. You may argue that it uses more resource to compute it but the reality is that we can use less footprint by comparing to word representation.
This story will discuss about SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) and further discussing about different subword algorithms. The following are will be covered:
- Byte Pair Encoding (BPE)
- WordPiece
- Unigram Language Model
- SentencePiece
Byte Pair Encoding (BPE)
Sennrich et al. (2016) proposed to use Byte Pair Encoding (BPE) to build subword dictionary. Radfor et al adopt BPE to…