Is a Zero Temperature Deterministic?

Karl Weinmeister
Google Cloud - Community
5 min readJul 17, 2024

Learn more about a crucial LLM model parameter, and how to configure it on Gemini Pro with Vertex AI

Temperature is one of the most important parameters of large language models (LLMs). It directly influences the creativity of generated text.

While higher temperatures encourage more diverse outputs, lower temperatures lead to predictable results. What happens at the extreme end, when the temperature is set to zero? Let’s explore this question.

Generating output tokens

LLMs generate text by predicting the probability of different tokens appearing in a sequence. These probabilities are represented as logits, which are raw scores assigned to each potential token.

Let’s say we have the following phrase as our starting point: “I am hungry for”. Our goal is to predict the next token. The model outputs the following tokens and logits:

Top tokens and logits produced by model output

To make the article more readable, I’m using full words. In practice, you would see tokens like “dump” instead of “dumpling,” as they generally average around 4 characters.

Transforming logits into probabilities

Logits are relative confidence scores, and you can’t directly use them for probabilities.

To transform these logits into interpretable probabilities, we use the softmax function. This function normalizes the logits, ensuring they sum to 1 and can be interpreted as probabilities:

Softmax function: apply the exponential function, normalized by the sum of all exponentials

Softmax applies the exponential function to each logit, and normalizes by the sum of the exponentials. The final result is that individual probabilities are mapped into the range [0,1], all summing to 1. Let’s compare before and after softmax is applied:

Token outputs before and after softmax is applied

Introduction to sampling

Once we have applied softmax, the next step is to “sample” or select from the probability distribution.

Let’s take a closer look at the distribution, and how 4 different popular sampling approaches would work:

Token probabilities
  • Greedy sampling is going to simply select the highest probability token. In our example, it would choose “pizza” with the highest probability of 0.47.
  • Temperature sampling uses the probabilities after applying a temperature scaling factor. Let’s say we select a high temperature. In this case, “tacos” or “adventure” would have a higher chance than their original probabilities suggest. A high temperature makes the sampling more random, so less likely options become more possible.
  • Top-k sampling constrains the choices to the top k most probable token. If k=2, then the choice would be between “pizza” and “tacos.”
  • Top-p (or nucleus) sampling constrains the choices to the most likely tokens within a given cumulative probability. For example, when set to 0.9, the model would sample from “pizza,” “tacos,” and “adventure.”

A closer look at temperature

You can think about temperature as a generalization of softmax. We can apply a scaling factor, represented by T below, to the softmax function.

Temperature is a scaling function applied to softmax inputs

We see that when T=1, that the result is exactly the same as the standard softmax function.

Softmax and a temperature of 1 are the same

Now, let’s try something more interesting. By setting a lower temperature, e.g. T=0.5, we can sharpen the distribution. This will favor tokens with already high probabilities, leading to more predictable outputs.

A smaller temperature results in a sharper distribution

When temperature is set to zero

You might notice that when T=0, we’d divide by 0! So, this scenario is actually a special case:

When the temperature is 0, all values are 0 except for the maximum value, which is set to 1

The good news is that, as the temperature approaches 0, we see the maximum value approach 1, and the others approach 0.

The probability of “pizza” approaches 1 as temperature approaches 0

In essence, setting temperature to 0 is an all-or-nothing proposition. The tokens and their probabilities in our example would be:

Probability distribution when Temperature = 0

There is only one choice to sample from, and therefore a temperature of 0 should be deterministic.

The reality: more determinism

In practice, the answer is more nuanced. A temperature of 0 will lead to more determinism, but does not necessarily full determinism. Let’s look at a few of the reasons why.

Computers use finite precision to represent numbers and operations, known as floating-point arithmetic. This can lead to rounding errors that can cascade through calculations.

These numerical deviations can be influenced by the hardware (CPU/GPU) that the model is hosted on. On top of that, data and task parallelism may be implemented differently, potentially impacting results. For a more detailed treatment, see this helpful 5 minute video, Causes and Effects of Unanticipated Numerical Deviations in Neural Network Inference Frameworks.

Likewise, the software stack can also be a factor, introducing the possibility of optimizations and randomness outside of sampling. For example, let’s say two tokens have identical probabilities. The model needs a mechanism to break the tie, which could be non-deterministic. Individual frameworks may provide guidance on how to improve determinism, such as this PyTorch reproducibility guide.

Configuring temperature in Google Cloud Vertex AI

Despite the potential for variations, setting temperature to 0 is still the first step in increasing deterministic behavior in LLMs.

Here’s how to configure temperature in Google Cloud Vertex AI:

from vertexai.generative_models import GenerationConfig, GenerativeModel

vertexai.init(project=project_id, location="us-central1")

model = GenerativeModel(model_name="gemini-1.5-pro-001")

response = model.generate_content(
prompt="When is the next total solar eclipse in US?",
generation_config=GenerationConfig(temperature=0.0),
)

By explicitly setting temperature=0.0within the GenerationConfig, you instruct the model to aim for maximal determinism.

For developers and data scientists and developers working with LLMs, understanding the nuances of temperature is crucial to achieving the desired results. By carefully tuning this parameter, you can control the balance between creativity and determinism.

For more information, see the Vertex AI Generative AI documentation, and try out use cases with the hundreds of notebooks in the Generative AI repository.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Karl Weinmeister
Karl Weinmeister

Written by Karl Weinmeister

Head of Product Developer Relations, Google Cloud

No responses yet