The Science of Control: How Temperature, Top_p, and Top_k Shape Large Language Models

Practical example of how temperature, top_k, and top_p hyperparameters work in models such as ChatGPT or Llama2

Daniel Puente Viejo
5 min readNov 1, 2023

In the age of artificial intelligence, language models have become indispensable tools for a wide array of applications, from chatbots and automated content generation to machine translation and text completion. These large language models (LLMs) are incredibly powerful, but with great power comes the need for precise control. In the realm of text generation, understanding and mastering hyperparameters like temperature, top_p, and top_k is the key to harnessing the full potential of these AI behemoths.

In this article, we embark on an exciting journey to demystify these three essential hyperparameters that govern text generation and creativity in large language models. We will delve deep into the intricacies of temperature, explore the subtleties of top_p, and understand the nuances of top_k. With each step, you’ll gain insights into how these settings work, how they affect the output, and how you can wield them to tailor your AI-generated text to your precise needs.

Let’s dive in!

1. Temperature

Temperature is a crucial hyperparameter in fine-tuning the output of large language models (LLMs) like GPT-3. It plays a vital role in controlling the randomness and creativity of generated text. The output of these large language models works as a function of the probability of word appearance. In other words, to generate a word, a probability is associated with each and every word in the dictionary and, based on this, it is determined how to proceed. The main idea of this hyperparameter is to adjust these probabilities to force randomness or determinism.

The generation of probabilities for each word in the dictionary is done in the last layer by applying softmax as an activation function. Recall that softmax acts on the logits to transform them into probabilities. And this is precisely where the temperature comes into play.

As can be seen from the above formula, softmax exponentiates each logit (x) and then, each exponentiated value is divided by the sum of all the exponentiated values. This step ensures that the output is a probability distribution, meaning that the values are between 0 and 1 and sum up to 1. The temperature hyperparameter is the value defined as “T” that is applied to each of the logits, making low temperatures skew the probabilities much more to the extremes.

Let’s imagine we have the following sentence:
- Yesterday I went to the cinema to see a ___

The idea is to predict the next word. The neural network will determine a probability for each of the words in the dictionary, in this case, we will use only 5 to simplify the process.

From the probabilities generated, randomness will be responsible for determining the word that follows. As can be seen, in the example above, normal softmax (temperature = 1) has been applied. In the following, we will see the result in the case of extreme values of these hyperparameters.

As temperature values approach 0 the higher probabilities increase further, making selection much more likely. Conversely, when the temperature gets much higher, the probabilities are softened, making more unexpected words more likely to be selected.

Therefore, when the temperature value is equal to 0, it becomes a deterministic solution. However, in case there are 2 words with the same logit value and hence the same probability, temperature 0 will make those words equally likely to be selected and add up to 1.

2. Top_p & Top_k

These 2 hyperparameters have the same purpose as the temperature but execute it in a different way.

  1. Top_p (Nucleus Sampling): It selects the most likely tokens from a probability distribution, considering the cumulative probability until it reaches a predefined threshold “p”. This limits the number of choices and helps avoid overly diverse or nonsensical outputs.
  2. Top_k (Top-k Sampling): It restricts the selection of tokens to thek” most likely options, based on their probabilities. This prevents the model from considering tokens with very low probabilities, making the output more focused and coherent.

In the image above we can see how top_p works. You can see how it goes through a sorting process, followed by the accumulation of probabilities. With these, a threshold is established and those words are selected. Finally, the probabilities are recalculated. On the assumption that the “p” value is less than the largest probability, the most likely word is simply selected.

Top_k works in the same way, but only considering the parameter “k” as the most probable words. As in the top_p process, the probabilities are sorted, but in this case, there is no need to accumulate. Once sorted, the threshold of “k” is set and, finally, the probabilities are recalculated.

Thanks for Reading!

Thank you very much for reading the article. If you liked it, don’t hesitate to follow me on Linkedin.

#deeplearning #python #machinelearning #llm

--

--