Understanding the controllable parameters to run/inference your Large Language Model

6 min readJul 7, 2023

This article helps understanding the parameters/settings while inferencing your Large Language Model

During inference or text generation in large language models, some settings and techniques can be controlled to influence the output. These settings are specific to the inference phase and do not affect the model’s training. Here are a few examples:

Max Length: This setting determines the maximum length of the generated output. It allows you to limit the number of tokens generated to avoid excessively long responses.
Top-k Sampling: Top-k sampling is a technique that constrains the next token selection to the top-k most likely tokens at each step. It helps control the diversity and randomness of the generated text by narrowing down the options.
Top-p (Nucleus) Sampling: As discussed earlier, top-p or nucleus sampling limits the selection of tokens to a subset of the vocabulary with cumulative probability mass up to a threshold value. It helps in controlling the diversity of generated output.
Repetition Penalty: Repetition penalty is a technique that penalizes or reduces the probability of generating tokens that have recently appeared in the generated text. It encourages the model to generate more diverse and non-repetitive output.
Context Prompting: By providing a specific context prompt or input, you can guide the model to generate text that aligns with that context. This can help ensure that the generated output is relevant and coherent within the given context.
Temperature Scaling: Temperature scaling, as discussed earlier, controls the randomness and diversity of the generated output. By adjusting the temperature, you can influence the trade-off between exploration and exploitation during text generation.
Post-processing: After generating the text, you can apply post-processing techniques to refine the output, remove unwanted artifacts, or improve the overall quality and coherence of the generated text.

These settings can be adjusted and fine-tuned based on the specific requirements of your application or use case. The choice of settings will depend on factors such as desired output diversity, coherence, relevance, and the nature of the task you are performing with the large language model during inference.

Below is some in-depth knowledge about the settings mentioned above:

Max length/ Max new tokens

It refers to a parameter or setting used in large language model deployment that limits the number of new tokens that can be generated during text generation or inference.

In language models, text generation involves sequentially predicting and generating tokens based on a given input prompt or context. The “Max new tokens” parameter allows you to set an upper limit on the number of tokens generated in addition to the input tokens.

By setting a maximum limit, you can control the length of the generated output and prevent the model from generating excessively long or verbose responses. This can be useful in various scenarios, such as when generating short answers, tweets, or summaries.

The value for “Max new tokens” is typically specified as an integer, representing the maximum number of additional tokens beyond the input tokens that the model should generate. The actual number of tokens in the generated output may be lower if the model encounters a special token indicating the end of the text or if it reaches the limit before generating the specified number of tokens.

By adjusting the “Max new tokens” parameter, you can control the length and verbosity of the model’s generated responses to align with the desired requirements of your application or use case.

Top-p (nucleus sampling)

"Top-p” or “nucleus sampling” is a probabilistic sampling technique used in large language models like GPT (Generative Pre-trained Transformer) during text generation. It helps control the diversity and randomness of the generated output.

During text generation, the model predicts the probability distribution of the next token based on the preceding context. Top-p sampling involves selecting the next token from the “nucleus” or the subset of the vocabulary that makes up the cumulative probability mass of the top-p most likely tokens. The value of “p” represents a cumulative probability threshold.

Here’s how Top-p or nucleus sampling works:

Calculate the cumulative probabilities of the tokens predicted by the model, sorted in descending order.
Keep adding the probabilities until the cumulative probability surpasses the threshold “p”.
Consider only the tokens that contribute to the cumulative probability up to the threshold “p” (the nucleus).
Randomly sample from the nucleus, giving higher probabilities to tokens that have a larger share of the cumulative probability mass.

By using Top-p sampling, you allow for a more controlled and diverse generation process. It avoids overly repetitive or deterministic outputs that can arise from always selecting the most likely token. Instead, the model has a chance to select from a wider range of options within the nucleus, which promotes more varied and creative text generation.

The value of “p” determines the diversity of the generated output. Higher values of “p” result in more randomness and diversity, as a larger number of tokens are considered for sampling. Lower values of “p” lead to more focused and deterministic output, as the model focuses on the most likely tokens.

Top-p or nucleus sampling is an effective technique to balance exploration and exploitation in text generation, allowing for controlled creativity and output diversity while maintaining coherence and quality.

Repetition penalty

“Repetition penalty” is a technique used in large language models during text generation to discourage repetitive or redundant output. It is designed to address the tendency of language models to produce repeated phrases, sentences, or patterns.

When applying repetition penalty, the model assigns a penalty or reduces the probability of generating tokens that have appeared recently in the generated text. This penalty helps promote more diverse and varied output by encouraging the model to generate new and different content instead of repeating itself.

The specific implementation of repetition penalty can vary depending on the model or framework being used. Common approaches include:

Token-Based Penalty: A penalty is applied to tokens based on their frequency of occurrence in the recent context or generated output. Tokens that have appeared more frequently are penalized more, reducing their probability of being generated again.
N-gram Penalty: The model considers sequences of tokens (n-grams) and applies a penalty based on the repetition of n-grams in the generated output. Higher penalties are assigned to n-grams that have occurred more frequently.
Temperature Scaling: Temperature scaling can be combined with repetition penalty to control the trade-off between exploration and exploitation during text generation. A higher temperature increases randomness and exploration, while a lower temperature promotes more focused and deterministic output.

Repetition penalty is a useful technique for improving the diversity and coherence of generated text. By penalizing repeated content, it encourages the model to generate more novel and varied responses, enhancing the quality and naturalness of the output.

The specific implementation and tuning of repetition penalty may vary depending on the language model and the desired level of repetition control. Experimentation and fine-tuning are often necessary to find the optimal balance between avoiding excessive repetition and maintaining the overall coherence and relevance of the generated text.

Temperature Scaling

The “temperature” setting is a parameter used in large language models during text generation to control the randomness and diversity of the generated output. It helps balance the exploration and exploitation trade-off during the sampling process.

When generating text, language models predict the probability distribution of the next token based on the preceding context. The temperature parameter adjusts the logits or probabilities of the predicted tokens before sampling.

A higher temperature value (> 1.0) increases the randomness and diversity of the generated output. It makes the model assign more equal probabilities to a wider range of tokens, allowing for more exploration and creative variations in the generated text. This can result in more unexpected and diverse output, but it may also introduce more noise or less coherent responses.

On the other hand, a lower temperature value (< 1.0) reduces the randomness and encourages the model to focus on the most probable tokens. It makes the distribution peakier, with higher probabilities assigned to the most likely tokens. This can lead to more deterministic and conservative output, with less variation and more coherent responses.

Choosing the appropriate temperature value depends on the desired balance between randomness and coherence in the generated text. It is often a matter of experimentation and adjusting the temperature setting to achieve the desired level of creativity, diversity, and coherence based on the specific use case or application.

Higher temperature values encourage exploration and can be useful when generating creative or speculative text, while lower temperature values promote more focused and controlled output, suitable for generating more specific or precise responses.

Note: You may play with these parameters for a LLaMA model here

Understanding the controllable parameters to run/inference your Large Language Model

Written by Explore With Yasir