Decoding Strategies of all Decoder only Models (GPT)

11 min readDec 10, 2023

In my previous blogs we studied about entire overview on Generative Pretrained Transformer and also a blog on Generative Pretrained Transformer (GPT) — Pre-training , Fine Tuning & Different Use Case Applications. Now let us look at what are the decoding strategies for all the decoder only models.

Decoding Strategies

In the previous blog we looked at transformers as a function which takes inputs and starts generating the next tokens or outputs while autoregressively i.e., it takes its own outputs as inputs at all the steps and it generates outputs.

During training also we trained in the similar manner because we showed certain text and we knew what next words are and we asked it to predict whats the next token is and then backpropogated the loss depending on the probablity of maximum token. The idea of next token prediction can be done iteratively to generate as many tokens as we want and may be we generate full stories.

Let’s say for example taking a sentence “Can you take a story which starts with once upon a time” so this entire thing has become first ‘k’ tokens given to the model and from this timestep we need to generate a story where prediction of tokens happens until we are satisfied or once we reach the end of the sequence <eos>.

Given that the model has been trained to predict the next tokens and something additional we will do something called as “Instruction fine-tuning” now we would want model to work in following scenarios where I have given it certain inputs and it has to start continuing the answers from there, so any problem that is given Or if some paragraph is given and asked to summarize then it has to summarize.

The initial fine-tuning problems like predicting the sentiment or like two sentences are similar or not — these are little easier compared to what we are seeing using modern LLMs application which are more creative applications like (write a poetry etc., write a resume, build a website etc..,) so these are the things which have surprised us at present. Obviously at present we do not know how these advanced LLMs (large language models) are able to produce such precise and creative output but what we are looking at present is the decoding part on how the next prediction of words happen — one thing we know is that if we are going with a process of selecting the maximum probability token then obviously we are going to get the same token output as this is a deterministic output. Now let’s look at some or few decoding strategies where we have some creative output for each of them where deterministic will give same outputs and stochastic ones will produce different outputs.

Exhaustive Search:

Suppose that we want to generate a sequence of 5 words with the vocabulary { cold, coffee, I , like , water, <stop>}

Exhaustive search for all possible sequences with the associated probabilties and output the sequence with the highest probabilty.

I like cold water
I like cold coffee
coffee like cold coffee
I like I like
coffee coffee coffee coffee

So for each of the sentence output the probability will be

P(x1, x2, x3,…..xn) = P(x1).P(x2/x1), ……….., P(xn/x1, x2, ……xn-1)

As this is exhaustive search — we will find all the possible sequences with the decoding process. At each of the timesteps we are going to pass all the words

As there are 6 words here we can have a distribution of these 6 words as shown below.

If one of the sample input sequence is “I like cold coffee <stop>”

The total probability for the above sequence will be equal to

P(I) * P(like/I)*P(cold/I,like)*P(coffee/I,like,cold)

Similarly other combinations of sequences also will follow in the same pattern as above and gives us the output with maximum probabilty — this probability calculation is done with all the tokens at every time step.

So based on the above exhaustive search let us assume these are the probabilities in the search space

Assuming the sequence has the highest probability among all |v|⁵ sequences — in this case above if “I like cold coffee ” sequence is generated as the highest probability then the outcome will be highlighted

With this exhaustive search , no matter how many times we calculate this — we are going to get the same answer for the given same input, there is no creative output we can see. This falls under deterministic strategy. A final sample illustration with all the tree type outputs are shown below —

of these 9 possibilities whichever has the maximum probability it gives that output at timestep =2. If we had timestep = 3 then we will have 27 sequences with probabilities and we take the maximum score with all those 27 sequences.

if |v| = 40000, then we need to run the decoder 40000 times in parallel.

Greedy Search:

With greedy search — at each time step we always output the token with the highest probability (Greedy)

p(w2 = like|w1=I) = 0.35

p(w3= cold | w1,w2) = 0.45

p(w4 = coffee |w1,w2,w3) = 0.35

p(w5 = stop | w1, w2, s3, s4) = 0.5

Then the probability of the generated sequence is

p(w5,w1,w2,w3,w4) = 0.5*0.35*0.45*0.35*0.5 = 0.011

Some limitations!

Is this the most likely sequence?

What if we want to get a variety of sequence of the same length?

If the starting token is the word “I”, then it will always end up producing the same sequence: I like cold coffee.

What if we picked the second most probable token in the first time step?

Then the conditional distribution in the subsequent time steps will change. Then the probability of the generated sequence is

p(w5,w1,w2,w3,w4) = 0.25*0.55*0.65*0.8*0.5 = 0.035

What if we picked the second most probable token in the first time step?

Then the conditional distribution in the subsequent time steps will change. Then the probablity of the generated sequence is

p(w5,w1,w2,w3,w4) = 0.25*0.55*0.65*0.8*0.5 = 0.035

We could output this sequence instead of the one generated by greedy search. This will also produce the same output when we send the same input tokens. Greedily selecting a token with maximum probability, every time step does not always give the sequence with maximum probability.

Beam search:

Instead of considering probability for all the tokens at every time step (as in exhaustive search) , consider only top-k tokens

Suppose (k=2), at timestep = 2 we have two tokens with probability for e.g., I , cold and we will be having 12 such sequences.

Now we have to choose the tokens that maximize the probability of the sequence. It requires k x |v| computations at each timestep. At second timestep we had 2 x 6=12 computations are then ranked and we select the highest probability sequence.

Let us select top 2 from the above probability scores.

Following the similar calculations, we end up choosing for timestep = 3 and 3 words or tokens

Now here we will have k sequences at the end of time step T and output the sequence which has the highest probability.

The parameter k is called beam size. It is an approximation to exhaustive search. If k = 1 then it is equal to greedy search. If k > 1 then we are doing beam search and if k = V then we are doing exhaustive search.

Now let’s take an example with k = 2 and tokens vocabulary is |v|.

of the above 2 * |V| values we again are going to take top 2 probabilities

we will have more sequences like that again we will have only 2 sequences are going forward — so finally our flowchart looks like this

Both the greedy search and the beam search are prone to be degenerative i.e., they can be repetitive without any creativity.
Latency for greedy search is lower than beam search
Neither greedy search nor beam search can result in creative outputs
Note however that the beam search strategy is highly suitable for tasks like translation and summarization.

Basically we need some surprises with creative answers or outputs — hence there we need some sampling based strategies without the greedy or beam search.

Sampling Strategies — Top -K

Here at every time step, consider top — k tokens from the probability distribution.

Sample a token from the top-k tokens. Say k = 2

The probability of top-k tokens will be normalized relatively as , P(I) = 0.61 ~ (0.25/ (0.25+0.4)), P(Coffee) = 0.39 ~ 0.4/(0.25+0.4) before sampling a token.

Let us assume and create a random number generator that predicts betwen 0 and 1 — rand(0,1). Suppose if the number obtained is ~0.7 then coffee will be the word or token that becomes as input at timestep 2, if again the random number generated is ~0.2 then at timestep 2 the word or token “I” will be the input.

The generated sequence using top-K sampling for top 2 words are

like < stop>

The normalized probabilities will be 0.15/(0.55+0.15)~0.23 and 0.55/(0.55+0.15) ~0.77 respectively for both the like and <stop>.

now we run the rand function to generate the number from 0 to 1 — assuming if value is 0.9 the <stop> will be the output then the outcome process will stop there. Next time when the random generator output is 0.5 then we will have “like” as the outcome. So by doing this random generation we will have different outputs. May be for the first “I” , “<stop>” is generated — for all other scenarios the outcomes can vary as shown below.

The surprise is an outcome of being random. How does beam search compare with human prediction at every time step?

If we look at beam search, it produces output with very high probability and hence we do not see any surprises — but If humans are asked to fill the sentences, we are going to have different and random outcomes with very less probability as human predictions have a high variance whereas beam search predictions have a low variance. Giving a chance to other highly probable tokens leads to a variety in the generated sequences.

Suppose if we have top 5 words from the 40K vocabulary ( I, go, where, now, then) and with probabilities as (0.3, 0.2, 0.1, 0.1,0.3) respectively.

If the random generator generates any number b/w 0 and 1 and based on that value we will be selecting or sample the word or token to select the high probability values. We have to remember that here we are not randomly picking the sample from 40K vocabulary but what we are doing is already we have top 5 words from the 40K vocabulary and from this subset of top 5 words or samples we are creating the sequences — here it is random but it is a controlled random selection of sequences.

Sampling Strategies — Top -P

What should be the optimal value of k be?

let us take 2 examples which is flat and peaky distribution respectively.

Example-1: (flat distribution)

Example — 2: (peaky distribution)

depending on the distribution type the value of k will vary — if we have a peaky distribution having a high value of k will not help as compared to a flat distribution where we can afford to have a little high value of k.

If we fix the vlaue of , say k = 5 , then we are missing out other equally probable tokens from the flat distribution.

It will miss to generate a variety of sentences (less creative)

For a peaked distribution, using the same value of k = 5, we might end up creating some meaning to less sentences.

Solution — 1 : Low temperature sampling

For temperature = 1 , this is how the distribution looks like where normal softmax equation comes in. Given the logits, u1: |v| and temperature parameter T , compute the probabilities as

If we decrease the T value — we get the peaky distribution.

Low temperature = skewed distribution = less creativity
high temperature = flatter distribution = more creativity

Solution — 2: Top — P (Nucleus) sampling

Let us again consider the above 2 examples.

Sort the probabilities in descending order
Set a value for the parameters p, 0 < p < 1
Sum the probabilities of tokens starting form the top token
If the sum exceeds p, then sample a token from the selected tokens
It is similar to top-k with k being dynamic. Suppose we set p = 0.6 as threshold,

For example -1 distribution: The model will sample from the tokens (thought, knew, had, saw, said)

For example-2 distribution: The model will sample from the tokens (hot, cooling)

Based on the random values generated — we are going to have different word tokens selected for sequence formation.

This is a wrap for all the decoding strategies for decoder only models i.e., GPT where we loooked at determinsitc and stochastic — this stochastic strategies are making sure that eventhough transformer has a deterministic computation output but at the end we are adding a sampling function which will make sure that we are sampling different tokens every time and hence generating different sequences.

Please do clap 👏 or comment if you find it helpful ❤️🙏

References:

Introduction to Large Language Models — Instructor: Mitesh M. Khapra