Understanding LLM Parameters in Layman Terms

Sakshi
4 min readSep 19, 2024

--

Temperature, Top-k, Top-p

Photo by Tim Mossholder on Unsplash

Introduction

When you use a Large Language Model (LLM) to generate text, there are a few settings you can adjust to control how the model behaves. The three most important ones are Temperature, Top-P, and Top-K. These settings help you balance how creative and unpredictable the model’s output is. Let’s break down each one in simple terms.

Temperature

Think of temperature as the creativity level of a story.

Temperature is like a dial that controls how “random” the model’s output is.

If you set it to a high number, the model will be more creative and come up with unexpected words and phrases. But if you set it too high, the output can become nonsensical. If you set it to a low number, the model will play it safe and stick to the most common and predictable words. The sweet spot is usually somewhere in the middle, around 0.7 to 1.0.

Example:

  • Low Temperature (e.g., 0.2): Imagine you’re telling a bedtime story to a child. You stick to classic fairy tales and familiar plots. The story is predictable and safe.
    Example: “Once upon a time, a princess lived in a castle…”
  • High Temperature (e.g., 0.8): Now, imagine you’re improvising a wild story. You throw in unexpected twists and unusual characters. It becomes more exciting but less structured.
    Example: “Once upon a time, a dragon fell in love with a toaster and together they baked the world’s largest pancake…”

Top-P (Nucleus Sampling)

Think of top-p as selecting from a variety of options based on popularity.

Top-P is like a filter that limits the number of possible words the model can choose from based on probability.

It looks at the probability of each word and only considers the ones that add up to a certain percentage (the Top-P value). For example, if you set Top-P to 0.9, the model will only look at the words that make up the top 90% of the probability distribution. This helps the model stay focused on the most relevant and coherent words.

Example:

  • Low Top-p (e.g., 0.2): Imagine you’re at a concert where only the top 20% of songs (the most popular hits) are played. You get a familiar experience, but it might be a bit boring because you know what to expect.
    Example: The model chooses from the most likely words, resulting in safe and standard sentences.
  • High Top-p (e.g., 0.9): Now imagine a concert where 90% of the songs played include both popular hits and some deeper cuts. You get a mix of familiar tunes and surprising new favorites, making the experience more exciting.
    Example: The model selects from a wider range of words, leading to more diverse and interesting sentences.

Top-K

Think of top-k as picking from a top 5 playlist of your favorite songs.

Top-K is similar to Top-P, but instead of a percentage, it’s a fixed number.

It tells the model to only consider the K most likely words at each step. For example, if you set Top-K to 50, the model will only look at the 50 most probable words. This can help the model stay on track and avoid generating completely random or irrelevant words. However, it’s generally better to use Top-P instead of Top-K, as it’s more flexible and intuitive.

Example:

Low Top-k (e.g., 5): You’re only allowed to pick from your top 5 favorite songs. This keeps your music choice safe and familiar.
Example: The model only uses the five most likely words at each step, resulting in straightforward and repetitive sentences.

  • High Top-k (e.g., 50): You have access to a much larger playlist. You can choose from 50 songs, including some hidden gems and unique tracks.
    Example: The model can explore a wider range of words, resulting in more varied and colorful sentences.

--

--