Decoding strategies in language modelling

Thomas
3 min readJan 28, 2022

--

Language models can calculate the probability that any given word in their vocabulary will come after the input sequence. But how do they choose which one to actually output?

How does the decoder choose the words “please”, “come” and “here” given the probability of every word in its vocabulary? Source: GPT-3, RNNs and all that: A deep-dive into language modelling.

At a high level, modern language models work like this: you input a sequence of words, then an “encoder network” processes them one by one. The output of that encoder network is then passed to a “decoder network” which will output a sequence of words. At each step, the decoder network can assign a probability to each of the words in the vocabulary it knows. But given the probabilities, you need to give the model a strategy for picking which words to output. This strategy is known as a decoder strategy.

The simplest one would be a greedy strategy: at each step, simply output the most likely word.

So in the first green node, the decoder RNN would assign a probability to each word in its vocabulary, and simply take the most likely one. This is a fine strategy, and it tends to produce coherent output. But if the application is something like a chat-bot, you may find that this produces quite boring conversation. For example:

Input: What do you think about this book?
Greedy strategy output: Yeah it’s fine.
Better output: I found it super interesting.

A simple variation on greedy search is top-k sampling: pick a number k, take the k most likely words and their respective probabilities. Then normalise the probabilities and sample them according to the obtained probability distribution. So instead of just taking the most likely word at each step, you choose from the k most likely words, and how you choose is by using the normalised probability distribution of those words. For example if k=5 and the normalised probabilities are [0.45, 0.25, 0.2, 0.05, 0.05] then you would pick the most likely word 45% of the time, the second most likely one 25% of the time, etc… This is what GPT-J, EleutherAI’s open source alternative to GPT-3, does.

Ideally we could compute the likelihood of every single possible output sentence (sequence of words). This is exhaustive search: if you want an output of five words, string together all the possible 5-word sentences, assign all of them a probability, and choose the most likely one. The issue with this is that if there are N words in your vocabulary, you need to consider N⁵ possible outputs. A typical value of N is 10000, so this is computationally expensive.

Beam search with beam width B=3. For each possible word in the vocabulary {e, a, b} we consider the three most likely examples. Edited from source.

One way to take this idea and make it a bit less computationally expensive is to do beam search. This combines ideas from all the above strategies. Pick a beam width B. At each step in the output: take the B most likely words, then for each of those B possibilities, consider the B most likely next words. Do this until the output is the desired length. Then compute the probability of each of these possible outputs (if we output 3 words there would be B³ possibilities) and output the most likely one.

In summary:

  • decoding strategies are how language models choose what words they should output once they know the probabilities.
  • popular decoding strategies include greedy search, top-k sampling, and beam search.

Further reading:

--

--

Thomas

Co-Founder of Chai Research. University of Cambridge Astrophysicist. Former quant trader.