LLM — Configurations

What are the configuration parameters that can influence the model’s output during inference?

6 min readJul 22, 2023

You’ve decided to use a large language model directly. Great! But still, you can make some changes to get different output. Don’t forget to watch my YouTube Video for this content 🎥

Let’s dive in!✨

The parameter configuration can be differentiated. You may use Sagemaker, Hugging Face, or Azure. This is the screenshot from Azure Open AI Studio:

Here is a link for Hugging Face and Amazon Sagemaker: https://huggingface.co/blog/sagemaker-huggingface-llm

Max Response shows us the maximum number of tokens for both prompt and completion. Remember that prompt includes your message as well as your few shot examples and the output of the model is completion.

If you don’t know what token means, just take a look at this website: https://platform.openai.com/tokenizer

On the other hand, Sagemaker uses the term: max_new_tokens and it represents the output only.

Remember that the last layer of the Large Language Model is softmax. It gives us probabilities for all words in the dictionary. (You can read my previous post here.]

After the softmax layer generates the output there are 2 methods to select the next token: greedy and random weighted sampling. Obviously, greedy selects the word with the highest probability. Top p and Top k sampling techniques help limit random weighted sampling. When we use top k and top p, the output will be more sensible.

Top p: “Similar to temperature, this controls randomness but uses a different method. Lowering Top P will narrow the model’s token selection to likelier tokens. Increasing Top P will let the model choose from tokens with both high and low likelihood. Try adjusting temperature or Top P but not both.” [Azure Open AI]
The explanation of Sagemaker is: “The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to null” [Sagemaker]
Top k: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is null, which disables top-k-filtering. [Sagemaker]
Temperature: “Controls randomness. Lowering the temperature means that the model will produce more repetitive and deterministic responses. Increasing the temperature will result in more unexpected or creative responses. Try adjusting temperature or Top P but not both.” [Azure OpenAI] It is the same in Sagemaker.

If you want a stable model and don’t want to see any random answers you can choose temperature as 0.

I’m bored of reading too. Let’s give some examples.

First, I would like to thank all the lecturers in Generative AI with Large Language Models course in Coursera! These examples are similar to those described in the lesson.

The softmax layer will give all probabilities for all words, but let’s focus on these 4:

Random sampling gives us a chance to be more creative.

You can limit creativity by choosing among the top k words.

Another way is using the top p method.

Temperature is a bit different, it affects the softmax layer so the output (probabilities) will be different based on your choice of temperature.

Ready to see a code?

Here is a colab link for inference. I use Google/flant5 model from the HuggingFace Transformers library. It is a summarization task.

First, we need to import the libraries the model, and the tokenizer:

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

# Load Dataset
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)

# Load Model  & Tokenizer
model_name = 'google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Let’s look at the data:

example_indices = [40, 200]
dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
  print(dash_line)
  print("Ex", i+1)
  print(dash_line)
  print('Input')
  print(dataset['test'][index]['dialogue'])
  print('Human summary')
  print(dataset['test'][index]['summary'])

OUTPUT: 
---------------------------------------------------------------------------------------------------
Ex 1
---------------------------------------------------------------------------------------------------
Input
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
Human summary
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------
Ex 2
---------------------------------------------------------------------------------------------------
Input
#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
Human summary
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

How can we use a tokenizer?

sentence = "What time is it?"
sentence_encoded = tokenizer(sentence, return_tensors='pt')

# sentence_encoded: {'input_ids': tensor([[363,  97,  19,  34,  58,   1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

sentence_decoded = tokenizer.decode(sentence_encoded["input_ids"][0],
                                    skip_special_tokens=True)

# sentence_decoded: What time is it?

We are ready to make an inference. Let’s select an example from our dataset. (We can write any other example but let’s use one of them to be more practical.) The data has the dialogue and the human summary so we can compare the model result and the actual summary.

# select an example
example_index = [40]

# get the dialogue
dialogue = dataset['test'][example_index]['dialogue']

# get the human summary
summary = dataset['test'][example_index]['summary']

Here I set the temperature as 0.7

# Configurations
generation_config = GenerationConfig(max_new_tokens=50, 
                                     do_sample=True, 
                                     temperature=0.7)

# Encode input
inputs_encoded = tokenizer(dialogue, return_tensors='pt')

# Model Output
model_output = model.generate(inputs_encoded["input_ids"], generation_config=generation_config)[0]

# Decode the output
output = tokenizer.decode(model_output, skip_special_tokens=True)

Let’s see the output!

Input:  ["#Person1#: What time is it, Tom?\n#Person2#: Just a minute. It's ten to nine by my watch.\n#Person1#: Is it? I had no idea it was so late. I must be off now.\n#Person2#: What's the hurry?\n#Person1#: I must catch the nine-thirty train.\n#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there."]
---------------------------------------------------------------------------------------------------
Human summary:  ['#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.']
---------------------------------------------------------------------------------------------------
Model Output:  #Person1#: I'm sorry, Tom. The train leaves at 10 and I'm on my way home.

:) The model output isn’t great, is it? This is a Zero-Shot Inference. We can make it better with Few-Shot may be. And that is prompt engineering. We will cover them in another post.

Happy learning! 🎉:)

LLM — Configurations

What are the configuration parameters that can influence the model’s output during inference?

Ready to see a code?

Written by Pelin Balci