How to set sampling temperature for GPT models

Isaac Misri
3 min readNov 9, 2021

--

Credit: Pixabay

GPT (Generative Pre-trained Transformer) models have several parameters, and understanding them is crucial for text generation tasks. Plenty has been said about top_k and top_p parameters. However, there seems to be some confusion about the temperature parameter. Here, I hope to clarify how this parameter affects your model’s output.

Suppose we’ve trained a model on our life story and we start a sequence with “I like red”. The model will then look at all possible words and sample from them given their probability distribution to predict the next word. Now suppose that our model’s vocabulary isn’t very large and it only looks at the words: “onions”, “pants”, “shoes”, and “apples”. In the plot below, it looks like “apples” has the highest probability (p = 0.644) of being chosen and “onions” has the lowest probability (p = 0.032).

In this example it is clear that “apples” will most likely be chosen. What if we want to make our model less deterministic and jazz up the word sequence? Or what if we want an even more predictable and more deterministic output sequence? This is where temperature kicks in.

Temperature is related to the Boltzmann distribution, the formula of which is shown below.

https://en.wikipedia.org/wiki/Boltzmann_distribution

Hopefully, this looks somewhat familiar. That’s because this is essentially our softmax function with an added temperature (T) parameter. Logits are scaled by our temperature value and then passed on to the softmax function to calculate a new probability distribution. In the example above of “I like red ___”, our temperature value was essentially 1. Let’s see what happens when we modify this value.

Looking at the figure above, we can clearly see that as we decrease the temperature, our model is more and more likely to choose the word “apples”. On the other extreme, we see that as we increase temperature we begin to get what looks like a uniform distribution (as T goes to infinity). When T=50, choosing the word “onions” is almost just as likely as choosing “apples”.

Oftentimes, temperature is associated with the model’s “creativity”. But this isn’t really what’s going on. Temperature simply adjusts the probability distribution of the words. We can say that at lower temperatures our model is more deterministic and at higher temperature values, it’s less so.

Ultimately, you’re gonna have to play around with this parameter for your specific task and see what works best. Generally speaking though, it looks like T values between .7 and .9 are used for creative text generation and values above 1 will tend to “derail” your model’s train of thought. For more on temperature check out this and this post.

--

--