How Did We Get Stuck With Temperature and Frequency Penalty?

6 min readDec 28, 2023

ChatGPT 4 prompt: “Generate an image of a frustrated software developer trying to figure out how to set LLM parameters such as temperature, frequency penalty, presence penalty, etc. The image should be in a comic strip style with dialogue balloons and popping onomatopoeias.”

Top Temperature Penalty!? 🌡️😵‍💫

Temperature, top P, frequency penalty, presence penalty — does anyone actually know what the optimal values are for these LLM parameters? What difference does temperature=0.0 vs temperature=0.2 make in a response? It would take an infinite number of tests to evaluate each prompt against every combination of parameter values.

To make things a little bit more ambiguous, the interfaces defined by LLM and infrastructure providers (e.g. OpenAI, LangChain) are unopinionated about whether or not these parameters are meant to be fixed or dynamic across the lifecycle of an LLM application. In LangChain’s ChatOpenAI class, the temperature is set when initializing the chat model object. Should the value be fixed?

model = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.5,
)

However, it can also be made configurable using the class’s configurable_fields() method. Or is it dynamic?

model = ChatOpenAI(temperature=0).configurable_fields(
    temperature=ConfigurableField(
        id="llm_temperature",
        name="LLM Temperature",
        description="The temperature of the LLM",
    )
)

As developers, we appreciate flexibility, but we yearn for best practices. Even for senior developers, it’s not perfectly clear how to tune a model’s configuration to get the best results. How did we end up stuck with setting these parameters?

LLMs Tuning LLMs 🎸🎵🎸

In theory, every individual prompt should have an optimal model configuration. As an experiment, I asked ChatGPT to classify 3 types of prompts. Each type would require different levels of accuracy and creativity in the response depending on the context of the prompt. Each prompt could be categorized as…

requiring a factually correct response
requiring an open-ended creative response
requiring a mix of both factual accuracy and creativity

The results of this experiment were the motivation for ChatDynamicParams. ChatDynamicParams is a LangChain chat model abstraction that demonstrates the ability to dynamically set a model’s temperature based on the context of the current prompt. The abstraction is powered by Ollama, a platform for running local LLMs. A local LLM is used to classify the prompt and determine an optimal temperature value. The implementation is such that it can easily be extended to support dynamically setting other LLM parameters (e.g. top P, frequency penalty, presence penalty, etc).

ChatDynamicParams ⚡⚙️

Initializing ChatDynamicParams only requires setting the chat model that it wraps. Setting a minimum temperature (temp_min) and maximum temperature (temp_max) are optional, but ensure that the final temperature value is within the desired range.

chat_dynamic_params = ChatDynamicParams(
    model=ChatOpenAI(model="gpt-3.5-turbo"),
    temp_min=0.0,
    temp_max=1.0,
)

The ChatDynamicParams object can be used in a chain just like any other chat model object.

# specify parameter constraints for ChatDynamicParams
chat_dynamic_params = ChatDynamicParams(
    model=ChatOpenAI(model="gpt-3.5-turbo"),
    temp_max=1.0,
)

# prompts
prompt_accuracy = "What's 1+1?"

# create chat prompt
chat_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant"),
    ("human", f"{prompt_accuracy}"),
])

# construct chain
chain = chat_prompt | chat_dynamic_params | StrOutputParser()

# prompt models
print(chain.invoke({}))

Ollama + Mistral 🦙🌞

The implementation of ChatDynamicParams leverages Ollama + Mistral to classify a prompt. The entire workflow of classification and translation to a temperature value is performed in a single function.

def _get_temperature(self, prompt: str) -> int:
    """Return optimal temperature based on prompt."""
    local_model_prompt = (
        "Classify the following LLM prompt by determining if it requires "
        "a factually correct response, if it requires a creative response, "
        "or if it requires a mix of both factual accuracy and creativity. "
        "Return only one of the following responses without any formatting: "
        "`accuracy`, `creativity`, or `mix`\n"
        "\n"
        f"Prompt: `{prompt}`"
    )
    response = self._local_model(local_model_prompt)

    # Retrieve the first token from response. This is typically the
    # classification value.
    first_token = response.split()[0].lower()

    # convert classification to temperature
    if "accuracy" in first_token:
        return self.temp_min
    elif "creativity" in first_token:
        return self.temp_max
    elif "mix" in first_token:
        return (self.temp_min + self.temp_max) / 2
    else:
        # default to original model temperature
        return getattr(self.model, "temperature")

A prompt that is classified as requiring a factually correct response translates to the minimum temperature value. A prompt requiring a creative response translates to the maximum temperature value. Finally, a prompt requiring a mix of both translates to a temperature value right in the middle. Simple. In the event of an unexpected error, the original model’s default temperature is returned.

Override _generate()

The class’s _generate() method is implemented to set the model’s temperature right before invoking the LLM.

def _generate(
    self,
    messages: List[BaseMessage],
    stop: Optional[List[str]] = None,
    run_manager: Optional[CallbackManagerForLLMRun] = None,
    **kwargs: Any,
) -> ChatResult:
    """Reset parameters of model based on messages."""
    prompt = self._get_prompt(messages)

    if hasattr(self.model, "temperature"):
        new_temp = self._get_temperature(prompt)
        logger.info(
            "Changing model temperature from "
            f"{getattr(self.model, 'temperature')} to {new_temp}"
        )
        setattr(self.model, "temperature", new_temp)

    return self.model._generate(
        messages=messages,
        stop=stop,
        run_manager=run_manager,
        **kwargs,
    )

Design Tradeoffs 📈📉

The final design of ChatDynamicParams is a balance between several competing constraints: ease of use, extendability, abstraction simplicity, and cost. Existing implementations for dynamically setting temperature were also considered.

Local LLM vs LLM Provider

LangChain’s Router example demonstrates how to classify a prompt based on the topic of the prompt. The dirty little secret of this example is that using OpenAI’s GPT (or any other external LLM provider) means you’ve effectively doubled your cost ($). Every prompt will be processed twice. Because of this reason, I felt that the implementation needed to leverage a local LLM or (some other effectively free implementation). Using Ollama seemed like a natural fit in this scenario.

Note: Even though doubling costs is avoided, the latency of the overall workflow is still doubled in the worst case.

Llama2 vs Mistral

In my very limited testing, Mistral seemed to outperform Llama2 in returning the expected classification responses and did so more consistently. Even the format of the responses from Mistral was more reliable.

Do your own testing. 💪

Exposing Configuration

The current interface for ChatDynamicParams is relatively limited. There’s an opportunity to expose more configuration (e.g. local model, prompts, etc) and give more control to the developer. However, I concluded that it would be counterintuitive to expose anything more complicated than temperature itself. If a developer is required to specify local_model_name and classification_prompt, then they might as well just set the temperature to begin with.

Abstract vs External Workflow

LangChain’s Router example (again) demonstrates a simple method for classifying a prompt and passing the classification results through a chain using a prompt branch. I considered a similar approach. However, two LangChain primitives make the approach obsolete. Because of prompt templates and prompt branches, the final prompt cannot be known until the entire chain is executed. Producing the final prompt in an external workflow could result in duplicative code (less maintainable). In the short term, it seemed preferable to simply abstract everything behind the chat model.

Wishes For Next Year🎄🎁

We’re still in the early innings of LLM development. In the future, we should expect that LLM configurations become simpler, more intuitive, and more opinionated. Infrastructure ergonomics should be improved and abstraction levels should be raised. Maybe there’s a world where temperature and top P simply go away completely. In the meantime, dynamically setting these parameters should be explored. The results from this proof-of-concept are not definitive one way or the other, but it does show that we don’t have to be stuck anymore.

References

ChatDynamicParams (GitHub)