Demystifying LLM Parameters for Inference

4 min readNov 4, 2023

Have you ever wondered how different configurations like temperature, top_p works. Every LLM model have different configurations controls but there are certain things that are common or similar. Let’s explore about that in detail.

It’s important to clarify that the configurations we are talking about are distinct from the training parameters utilized during the training phase. The configurations I’m talking about is the creativity aspect and control of the maximum words a model can generate.

Encoder decoder transformer architecture

The below encoder decoder transformer architecture diagram is a reference for you to understand about the softmax layer which we are going to control. This architecture is similar for decoder based model like GPT.

1. Max new token

When we pass any input words or sentences to the model it is first converted to text embeddings and passed to several self-attention layers which then gets passed through the feed-forward network and finally to a softmax layer. At this point we get our first token (I know this is an oversimplification on how it works but most LLM architecture works like this). The output of the first token is passed back to the model in a loop which again passes through all the layers and finally get the next token. This process continues until the model predicts an end-of-sequence token. At this point, the final sequence of tokens can be detokenized into words, and you have your output.

Now you understood how every token is generated. So, when you pass max_new_token the number of times the model will go through the selection process of the next word is going to be controlled. This is the maximum words the model is allowed to generate.

Note: Every model will have a limit on how much context it can handle. Let say the model can handle 4000 token contexts. let say your input has 3000 token and you set the maximum new token as 1500 then you will receive an error indicating model can only handle 4000 tokens context. So, in short make sure your max_new_token is as per below formula.

max_new_token ≤ model_context_length — input_token

Softmax Layer

The output from the transformer’s softmax layer is a probability distribution across the entire dictionary of words that the model uses. Here in the screenshot, you can see the word ‘fallen’. The output of the softmax is not the token ‘fallen’ it is a probability list of words. The selection of words is done thought their probability score next to them. There are multiple approaches for the following are the common ones .
a) Greedy Decoding
It is the simplest form of next-word prediction, where the model choose the word with the highest probability. This method can work very well for short generation but is susceptible to repeated words or repeated sequences of words. If you want to generate text that are more natural, more creative and avoids repeating words, you need to use some other controls.
b) Random Sampling
Random sampling is the easiest way to introduce some variability. Instead of selecting the most probable word every time the model chooses an output word at random using the probability distribution to weight the selection. Two Settings, top p and top k are sampling techniques that are commonly used to limit the random sampling and increase the chance that the output will be sensible.

2. Top — K

Let say we provide K as 3 then only the top 3 probability are sampled and one of the words is selected at random. In this case it is donut.

3. Top — P

Let say we choose p value of 0.30 then then top probability which sum to 30% will only be chosen from that one word is chosen at random.

4. Temperature

Most commonly used configuration is controlling the temperature.
• Higher the temperature → higher the randomness → higher the creativity
• Lower the temperature → the lower the randomness → generate word more likely follow the word sequences that the model learned during training.

Temperature parameter influences the shape of the probability distribution that the model calculates for the next token. It is a scaling factor that is applied within the final softmax layer of the model that impacts the shape of the probability distribution of the next token.

Here, when temperature is set to lower values the probability score of the word ‘cake’ is peaked. But at higher temperature the probability distribution is broader and flatter. The random sampling is applied on top of this distribution to select the word.

If temperature value is set to 1 then softmax function will have unaltered probability distribution.

Note: From openAI documentation it is recommended to use either temperature or top_p but not both.