Subword Techniques for Neural Machine Translation

Published in

Analytics Vidhya

6 min readMar 12, 2021

Neural Machine Translation (NMT) is the current state-of-the-art machine translation technique which produces fluent translations. However, NMT models are affected by the Out of Vocabulary (OOV) and Rare word problems, thus it degrades the translation quality. OOV words are the words that are not appearing in the corpus and rare words are the words which appear very few times in the corpus. When translating such unknown words, those words are replaced with UNK tokens. Therefore, translations become worse since these meaningless tokens increase the ambiguity by breaking the sentence structure.

Character segmentation is a technique used in machine translation to avoid the drawbacks of word level translation. The major advantage in character segmentation is that it can model any compositions of characters thus enabling better modelling of rare morphological variants. However, the improvements may not be much significant due to missing important information since the character level is more fine-grained.

To alleviate these issues, Sennrich et al. (2016) introduced the concept of segmenting words into sequences of subword units by providing a more meaningful representation. As an example for subword segmentation, consider the word “looked”. This word can be split into “look” and “ed”. In other words, two vectors are used to represent “looked”. Therefore, even if this word is an unknown word, still the model can translate the word accurately by treating it as a sequence of subword units.

With the advancement in natural language processing, various subword segmentation algorithms were proposed. Following subword techniques are comprehensively described in this article.

Byte Pair Encoding (BPE)
Unigram Language Model
Subword Sampling
BPE-dropout

Byte Pair Encoding (BPE)

Sennrich et al. (2016) proposed this word segmentation technique which is based on the Byte Pair Encoding compression algorithm. It is an effective approach to make the NMT model capable of translating rare and unknown words. It splits words into sequences of characters and iteratively combines the most frequent character pair into one.

Following are the steps of the BPE algorithm to obtain subwords.

Step 1: Initialize the vocabulary

Step 2: For each word in the vocabulary, append end of word token </w>

Step 3: Split the words into characters

Step 4: In each iteration get the most frequent character pair and merge them as one token & add this new token to the vocabulary

Step 5: Repeat Step 4 until the desired number of merge operations are completed or the desired vocabulary size is achieved

Learn BPE operations (Sennrich et al., 2016)

Unigram Language Model

Kudo (2018) proposed the Unigram language model based subword segmentation algorithm which outputs multiple subword segmentation along with their probabilities. The model assumes that each subword occurs independently. The probability of a subword sequence x=(x1,…,xM) is obtained by multiplying the subword occurrence probabilities p(xi).

Here, V is a predetermined vocabulary. The most probable segmentation x* for the sentence X is given by,

S(X) is a set of segmentation candidates obtained using the sentence X. x* is obtained with the Viterbi algorithm.

Subword occurrence probabilities p(xi) are estimated using the expectation-maximization (EM) algorithm by maximizing the following likelihood L.

Following steps describe the procedure of obtaining the vocabulary V with a desired size.

Step 1: Initialize a reasonably big seed vocabulary.

Step 2: Define a desired vocabulary size.

Step 3: Optimize the subword occurrence probabilities using the EM algorithm by fixing the vocabulary.

Step 4: Compute the loss for each subword. The loss of a subword depicts the decrement in the aforementioned likelihood L when that subword is removed from the vocabulary.

Step 5: Sort the subwords by loss and keep the top n% of subwords. Keep the subwords with a single character to avoid the out of vocabulary problem.

Step 6: Repeat step 3 to 5 until it reaches the desired vocabulary size defined in step 2.

The most common way to prepare the seed vocabulary is to use the most frequent substrings and characters in the corpus. This unigram language model based subword segmentation consists of characters, subwords and words.

Subword Sampling

In this technique, the models are trained with multiple subword segmentation based on a unigram language model and those are probabilistically sampled during training. L-best segmentation is an approach that can be used for approximate sampling. First, the l-best segmentations are obtained and after performing l-best search, one segmentation is sampled.

Subword regularization has two hyperparameters which are the size of sampling candidates (l) and smoothing constant (α). Theoretically, setting l→∞ means considering all possible segmentations. But it is infeasible since the number of characters exaggerates exponentially with the length of the sentence. Therefore, the Forward-Filtering and Backward-Sampling algorithm is used for sampling. Further, if α is small, the distribution is more uniform and if α is large, it tends towards the Viterbi segmentation.

BPE-dropout

BPE-dropout is an effective subword regularization method based on BPE, which enables multiple segmentations for a particular word. This keeps the BPE vocabulary and the merge table as original while changing the segmentation procedure. Here, some merges are randomly removed with a probability of p at each merge step, thus giving multiple segmentations for the same word. Following algorithm describes the procedure.

If the probability is zero, the subword segmentation is equal to the original BPE. If the probability is one, the subword segmentation is equal to character segmentation. If the probability is varied from 0 to 1, it gives multiple segmentations with various granularities. Since this method exposes the models to various subword segmentation, it gives the ability to have a better understanding of words and subwords. BPE-dropout is a simple procedure since training can be done without training any segmentations other than BPE and inference uses the standard BPE.

This article explores various subword techniques to improve Neural Machine Translation. A sample implementation of Transformer architecture based NMT model can be found here, which applies BPE and unigram language model based subword sampling using Sentencepiece library.

References

[1] R. Sennrich, B. Haddow, and A. Birch, Neural Machine Translation of Rare Words with Subword Units (2016), 54th Annual Meeting of the Association for Computational Linguistics

[2] T. Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (2018), 56th Annual Meeting of the Association for Computational Linguistics

[3] I. Provilkov, D. Emelianenko and E. Voita, BPE-Dropout: Simple and Effective Subword Regularization (2020), 58th Annual Meeting of the Association for Computational Linguistics

[4] T. Kudo and J. Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018), Conference on Empirical Methods in Natural Language Processing (System Demonstrations)