Two minutes NLP — Most used Decoding Methods for Language Models

Greedy Search, Beam Search, Top-k Sampling, and Nucleus Sampling

Fabio Chiusano
NLPlanet

--

Photo by Aryo Yarahmadi on Unsplash

In recent years, we saw the rise of large transformer-based language models trained on millions of web pages, such as OpenAI’s GPT3 model. The results on conditioned open-ended language generation from language models and their applications are very impressive.

To generate high-quality text from these language models, several decoding methods have been proposed. In this article, you’ll see an overview of different decoding strategies: Greedy search, Beam search, Sampling, Top-K sampling, and Top-p (nucleus) sampling. You can use the Hugging Face transformers library to easily test these decoding methods.

Greedy Search

Greedy search selects from the language model the word with the highest probability as the next word. Suppose the language model is predicting a possible continuation to a sentence starting with “The”, choosing from the words “red” and “cat” with their associated probabilities.

Example of word probabilities predicted from a language model. Image by the author.

Greedy search always chooses the word with the highest probability, which is “cat”.

Example of word probabilities predicted from a language model, highlighting the choice made by greedy search at the first timestep. Image by the author.

Subsequently, greedy search chooses the next word with the highest probability, which is “is”.

Example of word probabilities predicted from a language model, highlighting the choice made by greedy search at the second timestep. Image by the author.

The major drawback of greedy search is that it’s not optimal in producing high-probability sentences, as it misses high probability words hidden behind low probability words. Indeed, in our example, the sentence with the highest probability is “The red fox” (0.4 * 0.9 = 0.36), as opposed to “The cat is” (0.6 * 0.5 = 0.30).

Beam search

Beam search addresses this problem by keeping the most likely hypotheses (a.k.a. beams) at each time step and eventually choosing the hypothesis that has the overall highest probability.

Suppose we are performing a beam search with two beams. At the first timestep, beam search would consider both the words “red” and “cat”, along with their probabilities.

Example of word probabilities predicted from a language model, highlighting the two beams produced by beam search at the first timestep. Image by the author.

At the second timestep, beam search will consider all the possible continuations of the two beams and keep only the two of them with the highest probabilities. All the possible continuations with their probabilities are:

  • The red cat: 0.4 * 0.1 = 0.01
  • The red fox: 0.4 * 0.9 = 0.36 (highest)
  • The cat drives: 0.6 * 0.2 = 0.12
  • The cat drinks: 0.6 * 0.3 = 0.18
  • The cat is: 0.6 * 0.5 = 0.30 (second highest)

Thus, the two beams at the next timestep will be “The red fox” and “The cat is”.

Example of word probabilities predicted from a language model, highlighting the two beams produced by beam search at the second timestep. Image by the author.

If we want to end the beam search here, our predicted result would be the beam with the highest probability, which is “The red fox”.

Example of word probabilities predicted from a language model, highlighting the final prediction made by beam search. Image by the author.

It has been observed that the results are more fluent with respect to greedy search, but the output often includes repetitions of the same word sequences. A simple remedy is to penalize results whenever they contain n-grams present in the already predicted text. Nevertheless, n-gram penalties have to be used with care. For example, an article generated about the city “New York” should not use a 2-gram penalty, otherwise, the name of the city would only appear once in the whole text.

In general, it’s hard to finetune n-grams penalties and it turns out that high-quality human language does not follow a distribution of high probability next words.

As a consequence, other decoding methods have been developed to address these issues.

Sampling

Sampling means randomly picking the next word​ according to the conditional word probability distribution extracted from the language model. As a consequence, with this decoding method text generation is not deterministic.

Using directly the probabilities extracted from the language models often leads to incoherent text. A trick is to make the probability distribution sharper (as in increasing the likelihood of high-probability words and decreasing the likelihood of low-probability words) by applying a softmax over the probability distribution and varying its temperature parameter to make it sharper or smoother. With this trick, the output is generally more coherent.

Top-K Sampling

In Top-K sampling, the K most likely next words are filtered and then the next predicted word will be sampled among these K words only.

The produced text is often more human-sounding than the text produced with the other methods seen so far. One concern though with Top-K sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution, as K is fixed. As a consequence, very unlikely words may be selected among these K words if the next word probability distribution is very sharp.

As a consequence, Top-p sampling has been created.

Top-p (nucleus) sampling

Top-p sampling (or nucleus sampling) chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. This way, the number of words in the set can dynamically increase and decrease according to the next word probability distribution.

While Top-p sampling seems more elegant than Top-K sampling, both methods work well in practice.

Conclusion

Top-p sampling and top-K sampling seem to produce more fluent text than greedy search and beam search. However, recently, there has been some evidence that greedy and beam search perform better if a different training objective is used by the language model. Also, it looks as top-K and top-p sampling also suffer from generating repetitive word sequences.

This suggests that we are still figuring out the best way to train language models and to produce text that is as human-like as possible.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence