Two minutes NLP — Most used Decoding Methods for Language Models
Greedy Search, Beam Search, Top-k Sampling, and Nucleus Sampling
In recent years, we saw the rise of large transformer-based language models trained on millions of web pages, such as OpenAI’s GPT3 model. The results on conditioned open-ended language generation from language models and their applications are very impressive.
To generate high-quality text from these language models, several decoding methods have been proposed. In this article, you’ll see an overview of different decoding strategies: Greedy search, Beam search, Sampling, Top-K sampling, and Top-p (nucleus) sampling. You can use the Hugging Face transformers library to easily test these decoding methods.
Greedy Search
Greedy search selects from the language model the word with the highest probability as the next word. Suppose the language model is predicting a possible continuation to a sentence starting with “The”, choosing from the words “red” and “cat” with their associated probabilities.
Greedy search always chooses the word with the highest probability, which is “cat”.
Subsequently, greedy search chooses the next word with the highest probability, which is “is”.
The major drawback of greedy search is that it’s not optimal in producing high-probability sentences, as it misses high probability words hidden behind low probability words. Indeed, in our example, the sentence with the highest probability is “The red fox” (0.4 * 0.9 = 0.36), as opposed to “The cat is” (0.6 * 0.5 = 0.30).
Beam search
Beam search addresses this problem by keeping the most likely hypotheses (a.k.a. beams) at each time step and eventually choosing the hypothesis that has the overall highest probability.
Suppose we are performing a beam search with two beams. At the first timestep, beam search would consider both the words “red” and “cat”, along with their probabilities.
At the second timestep, beam search will consider all the possible continuations of the two beams and keep only the two of them with the highest probabilities. All the possible continuations with their probabilities are:
- The red cat: 0.4 * 0.1 = 0.01
- The red fox: 0.4 * 0.9 = 0.36 (highest)
- The cat drives: 0.6 * 0.2 = 0.12
- The cat drinks: 0.6 * 0.3 = 0.18
- The cat is: 0.6 * 0.5 = 0.30 (second highest)
Thus, the two beams at the next timestep will be “The red fox” and “The cat is”.
If we want to end the beam search here, our predicted result would be the beam with the highest probability, which is “The red fox”.
It has been observed that the results are more fluent with respect to greedy search, but the output often includes repetitions of the same word sequences. A simple remedy is to penalize results whenever they contain n-grams present in the already predicted text. Nevertheless, n-gram penalties have to be used with care. For example, an article generated about the city “New York” should not use a 2-gram penalty, otherwise, the name of the city would only appear once in the whole text.
In general, it’s hard to finetune n-grams penalties and it turns out that high-quality human language does not follow a distribution of high probability next words.
As a consequence, other decoding methods have been developed to address these issues.
Sampling
Sampling means randomly picking the next word according to the conditional word probability distribution extracted from the language model. As a consequence, with this decoding method text generation is not deterministic.
Using directly the probabilities extracted from the language models often leads to incoherent text. A trick is to make the probability distribution sharper (as in increasing the likelihood of high-probability words and decreasing the likelihood of low-probability words) by applying a softmax over the probability distribution and varying its temperature parameter to make it sharper or smoother. With this trick, the output is generally more coherent.
Top-K Sampling
In Top-K sampling, the K most likely next words are filtered and then the next predicted word will be sampled among these K words only.
The produced text is often more human-sounding than the text produced with the other methods seen so far. One concern though with Top-K sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution, as K is fixed. As a consequence, very unlikely words may be selected among these K words if the next word probability distribution is very sharp.
As a consequence, Top-p sampling has been created.
Top-p (nucleus) sampling
Top-p sampling (or nucleus sampling) chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. This way, the number of words in the set can dynamically increase and decrease according to the next word probability distribution.
While Top-p sampling seems more elegant than Top-K sampling, both methods work well in practice.
Conclusion
Top-p sampling and top-K sampling seem to produce more fluent text than greedy search and beam search. However, recently, there has been some evidence that greedy and beam search perform better if a different training objective is used by the language model. Also, it looks as top-K and top-p sampling also suffer from generating repetitive word sequences.
This suggests that we are still figuring out the best way to train language models and to produce text that is as human-like as possible.