Comparing LLMs with MLFlow

Compare LLM inputs, outputs, and generation parameters with mlflow.evaluate()

Daniel Liden
9 min readJul 14, 2023
mlflow.evaluate() lets you compare LLMs on the same inputs, and the Artifact View in the MLFlow UI provides a user-friendly way to explore the evaluation results.

Comparing models is just as important in Large Language Model Ops (LLMOps) as it is in MLOps, but the process for doing so is a little less clear. In “classical” machine learning, it usually suffices to compare models on a set of clear numerical metrics; the model with the better score wins. This is not typically the case with LLMs (though there are plenty of performance benchmarks to capture and quantify various aspects of model performance).

Selecting the best LLM model can depend on some less-tangible (or at least less-quantifiable) model characteristics. Some practitioners refer to taking “vibe checks” of a model. Models might differ in terms of the tone or detail of their responses, or in terms of their correctness in various niche areas.

A fairly simple requirement of any LLMOps platform, then, is the ability to straightforwardly compare the outputs of different models on the same prompts. This is one of the features enabled by using MLFlow for LLMOps.

In this post, we’ll walk through the process of comparing a few small (<1B parameter) open-source text-generation models with one of MLFlow’s core LLMOps capabilities, the mlflow.evaluate() function. We’ll use these small models to make it easier to test out the comparisons without needing to worry so much about provisioning sufficient cloud resources to run through the examples. Note, however, that the outputs from these small models aren’t always very coherent or relevant.

You can find all of the example code in this notebook.

The Models

We’ll compare some of the most-downloaded text-generation models on Hugging Face with less than 1 billion parameters: gpt2-large (774M parameters), bloom-560m (560M parameters), and distilgpt2 (82M parameters).

Defining the Models

First, we’ll install the required Python packages (in this case, in a fresh virtual environment).

pip install transformers accelerate torch mlflow xformers

And then we’ll set up each of the models as a 🤗 transformers pipeline wrapped in a pyfunc-compatible model wrapper. This step is necessary because mlflow.evaluate, the primary MLFlow LLMOps tool we’ll be using, expects a pyfunc model instance as its first argument. At this stage, we’ll also provide the pyfunc models with some generation configurations to the transformers pipelines to control the number of new tokens generated, whether to use sampling, sampling parameters, and much more. We can also pass some examples for multi-shot prompting when setting up the model.

import mlflow
import pandas as pd
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
pipeline,
GenerationConfig,
)


class PyfuncTransformer(mlflow.pyfunc.PythonModel):
"""PyfuncTransformer is a class that extends the mlflow.pyfunc.PythonModel class
and is used to create a custom MLflow model for text generation using Transformers.
"""

def __init__(self, model_name, gen_config_dict=None, examples=""):
"""
Initializes a new instance of the PyfuncTransformer class.

Args:
model_name (str): The name of the pre-trained Transformer model to use.
gen_config_dict (dict): A dictionary of generation configuration parameters.
examples: examples for multi-shot prompting, prepended to the input.
"""
self.model_name = model_name
self.gen_config_dict = (
gen_config_dict if gen_config_dict is not None else {}
)
self.examples = examples
super().__init__()

def load_context(self, context):
"""
Loads the model and tokenizer using the specified model_name.

Args:
context: The MLflow context.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
model = AutoModelForCausalLM.from_pretrained(
self.model_name,
# device_map="auto"
# make the device CPU
device_map="cpu",
)

# Create a custom GenerationConfig
gcfg = GenerationConfig.from_model_config(model.config)
for key, value in self.gen_config_dict.items():
if hasattr(gcfg, key):
setattr(gcfg, key, value)

# Apply the GenerationConfig to the model's config
model.config.update(gcfg.to_dict())

self.model = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
)

def predict(self, context, model_input):
"""
Generates text based on the provided model_input using the loaded model.

Args:
context: The MLflow context.
model_input: The input used for generating the text.

Returns:
list: A list of generated texts.
"""
if isinstance(model_input, pd.DataFrame):
model_input = model_input.values.flatten().tolist()
elif not isinstance(model_input, list):
model_input = [model_input]

generated_text = []
for input_text in model_input:
output = self.model(
self.examples + input_text, return_full_text=False
)
generated_text.append(
output[0]["generated_text"],
)

return generated_text

Now we can instantiate the models:

gcfg = {
"max_length": 180,
"max_new_tokens": 10,
"do_sample": False,
}

example = (
"Q: Are elephants larger than mice?\nA: Yes.\n\n"
"Q: Are mice carnivorous?\nA: No, mice are typically omnivores.\n\n"
"Q: What is the average lifespan of an elephant?\nA: The average lifespan of an elephant in the wild is about 60 to 70 years.\n\n"
"Q: Is Mount Everest the highest mountain in the world?\nA: Yes.\n\n"
"Q: Which city is known as the 'City of Love'?\nA: Paris is often referred to as the 'City of Love'.\n\n"
"Q: What is the capital of Australia?\nA: The capital of Australia is Canberra.\n\n"
"Q: Who wrote the novel '1984'?\nA: The novel '1984' was written by George Orwell.\n\n"
)

bloom560 = PyfuncTransformer(
"bigscience/bloom-560m",
gen_config_dict=gcfg,
examples=example,
)
gpt2large = PyfuncTransformer(
"gpt2-large",
gen_config_dict=gcfg,
examples=example,
)
distilgpt2 = PyfuncTransformer(
"distilgpt2",
gen_config_dict=gcfg,
examples=example,
)

Logging the Models in MLFlow

Next, we log the models in MLFlow. Model logging in MLFlow is essentially version control for Machine Learning models: it guarantees reproducibility by recording model and environment details. Model logging also allows us to track (and compare) model versions, so we can refer back to older models and see what effects changes to the models have.

To log the models we defined above, we’ll work through the following steps:

  1. Set up an MLFlow Experiment. Experiments are useful for organizing groups of related runs. In this case, since we’re directly comparing a group of models, it makes sense to include all of those models under the same experiment.
  2. Log each model in a separate MLFlow run. Make sure to record the run IDs and artifact paths, as we’ll need them for comparing the models later.
mlflow.set_experiment(experiment_name="compare_small_models")
run_ids = []
artifact_paths = []
model_names = ["bloom560", "gpt2large", "distilgpt2"]

for model, name in zip([bloom560, gpt2large, distilgpt2], model_names):
with mlflow.start_run(run_name=f"log_model_{name}"):
pyfunc_model = model
artifact_path = f"models/{name}"
mlflow.pyfunc.log_model(
artifact_path=artifact_path,
python_model=pyfunc_model,
input_example="Q: What color is the sky?\nA:",
)
run_ids.append(mlflow.active_run().info.run_id)
artifact_paths.append(artifact_path)

The code above creates an experiment called compare_small_models and, in one MLFlow run per model, logs the models. At this point, you can inspect the models in the MLFlow UI, which you can start with the MLFlow UI command.

We now have easy access to our models and their environment details within MLflow. We can now compare these models mlflow.evaluate().

Comparing LLMs with MLFlow

With the models logged in MLFlow, we can proceed with the comparison. First, we need an evaluation dataset made up of the inputs on which we want to compare these models. We’ll define a Pandas DataFrame with these inputs (the prompts we’ll send to each of the language models).

At this point, it’s worth thinking about what we actually want to compare in each of these cases. Are we more interested in which model(s) get the answers right? Or in general coherency, maybe to get a sense of which model provides the best starting point for fine-tuning or domain adaptation? We’ll form our evaluation dataset as a series of questions across a few different domains, with the aim of seeing which model returns the most coherent answers.

eval_df = pd.DataFrame(
{
"question": [
"Q: What color is the sky?\nA:",
"Q: Are trees plants or animals?\nA:",
"Q: What is 2+2?\nA:",
"Q: Who is Darth Vader?\nA:",
"Q: What is your favorite color?\nA:",
]
}
)
print(eval_df)

Now that we’ve logged our models and set up our dataset, it’s time to evaluate! We will again loop through the models. This time, we’ll re-open the run to which each model was logged and evaluate the model in the same run. This approach is not absolutely required, but it works well organizationally and ensures all of the elements appear in the UI in a clean format.

for i in range(3):
with mlflow.start_run(
run_id=run_ids[i]
): # reopen the run with the stored run ID
evaluation_results = mlflow.evaluate(
model=f"runs:/{run_ids[i]}/{artifact_paths[i]}",
model_type="text",
data=eval_df,
)

Now we can directly compare the different models’ outputs. We can load the results as a Pandas DataFrame with mlflow.load_table(‘eval_results_table.json’) or see the results in the UI by navigating to the “artifacts view”.

Remember: these are very small models (relatively speaking), so the outputs aren’t always the best! But the Artifact View in MLFlow makes it easy to look at the results of the model comparison.

Recording Generation Parameters with mlflow.evaluate()

There are many different ways to use evaluation in MLFlow for LLMOps. We can modify the approach above to accept generation configuration parameters at inference time, so we can compare many of the same inputs with different generation configurations and track those configurations in the evaluation table. Doing so just requires that we change the structure of the pyfunc model inputs. Instead of accepting only a prompt, we’ll modify the model to accept a prompt, a set of examples (so we can try out different few-shot prompting examples), and a set of generation configuration parameters.

import json


class PyfuncTransformerWithParams(mlflow.pyfunc.PythonModel):
"""PyfuncTransformer is a class that extends the mlflow.pyfunc.PythonModel class
and is used to create a custom MLflow model for text generation using Transformers.
"""

def __init__(self, model_name):
"""
Initializes a new instance of the PyfuncTransformer class.

Args:
model_name (str): The name of the pre-trained Transformer model to use.
examples: examples for multi-shot prompting, prepended to the input.
"""
self.model_name = model_name
super().__init__()

def load_context(self, context):
"""
Loads the model and tokenizer using the specified model_name.

Args:
context: The MLflow context.
"""
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
model = AutoModelForCausalLM.from_pretrained(
self.model_name, device_map="auto"
)

self.model = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
)

def predict(self, context, model_input):
"""
Generates text based on the provided model_input using the loaded model.

Args:
context: The MLflow context.
model_input: The input used for generating the text.

Returns:
list: A list of generated texts.
"""
if isinstance(model_input, pd.DataFrame):
model_input = model_input.to_dict(orient="records")
elif not isinstance(model_input, list):
model_input = [model_input]

generated_text = []
for record in model_input:
input_text = record["input_text"]
few_shot_examples = record["few_shot_examples"]
config_dict = record["config_dict"]
# Update the GenerationConfig attributes with the provided config_dict
gcfg = GenerationConfig.from_model_config(self.model.model.config)
for key, value in json.loads(config_dict).items():
if hasattr(gcfg, key):
setattr(gcfg, key, value)

output = self.model(
few_shot_examples + input_text,
generation_config=gcfg,
return_full_text=False,
)
generated_text.append(output[0]["generated_text"])

return generated_text

Then, in a new experiment, we’ll consider just one of the models with a range of different generation parameters and examples. Here’s the data we’ll be using:

few_shot_examples_1 = (
"Q: Are elephants larger than mice?\nA: Yes.\n\n"
"Q: Are mice carnivorous?\nA: No, mice are typically omnivores.\n\n"
"Q: What is the average lifespan of an elephant?\nA: The average lifespan of an elephant in the wild is about 60 to 70 years.\n\n"
)

few_shot_examples_2 = (
"Q: Is Mount Everest the highest mountain in the world?\nA: Yes.\n\n"
"Q: Which city is known as the 'City of Love'?\nA: Paris is often referred to as the 'City of Love'.\n\n"
"Q: What is the capital of Australia?\nA: The capital of Australia is Canberra.\n\n"
"Q: Who wrote the novel '1984'?\nA: The novel '1984' was written by George Orwell.\n\n"
)

config_dict1 = {
"do_sample": True,
"top_k": 10,
"max_length": 180,
"max_new_tokens": 10,
}
config_dict2 = {"do_sample": False, "max_length": 180, "max_new_tokens": 10}

few_shot_examples = [few_shot_examples_1, few_shot_examples_2]
config_dicts = [config_dict1, config_dict2]

questions = [
"Q: What color is the sky?\nA:",
"Q: Are trees plants or animals?\nA:",
"Q: What is 2+2?\nA:",
"Q: Who is the Darth Vader?\nA:",
"Q: What is your favorite color?\nA:",
]

data = {
"input_text": questions * len(few_shot_examples),
"few_shot_examples": [
example for example in few_shot_examples for _ in range(len(questions))
],
"config_dict": [
json.dumps(config)
for config in config_dicts
for _ in range(len(questions))
],
}

eval_df = pd.DataFrame(data)

We can then evaluate our model on this dataset using the same process as before: defining the model, logging the model, and then running mlflow.evaluate. In this case, because the dataset includes more model inputs, we’ll see more fields in the evaluation results.

# Define the pyfunc model
bloom560_with_params = PyfuncTransformerWithParams(
"bigscience/bloom-560m",
)

mlflow.set_experiment(experiment_name="compare_generation_params")
model_name = "bloom560"

with mlflow.start_run(run_name=f"log_model_{model_name}"):
# Define an input example
input_example = pd.DataFrame(
{
"input_text": "Q: What color is the sky?\nA:",
"few_shot_examples": example, # Assuming 'example' is defined and contains your few-shot prompts
"config_dict": {}, # Assuming an empty dict for the generation parameters in this example
}
)

# Define the artifact_path
artifact_path = f"models/{model_name}"

# log the data
eval_data = mlflow.data.from_pandas(eval_df, name="evaluate_configurations")

# Log the model
mod = mlflow.pyfunc.log_model(
artifact_path=artifact_path,
python_model=bloom560_with_params,
input_example=input_example,
)

# Define the model_uri
model_uri = f"runs:/{mlflow.active_run().info.run_id}/{artifact_path}"

# Evaluate the model
mlflow.evaluate(model=model_uri, model_type="text", data=eval_data)

You’ll notice one other change we made this time around: we saved the evaluation data as an MLFlow Dataset with eval_data = mlflow.data.from_pandas(eval_df, name=”evaluate_configurations”) and then referred to this dataset in our evaluate() call, explicitly associating the dataset with the evaluation. We can retrieve the dataset information from the run in the future if needed, ensuring that we don’t lose track of the data used in the evaluation.

In this version, we passed some additional parameters (few-shot examples and a generation config) to the models at inference time and tracked them in our evaluation. Additionally, notice that the dataset, `evaluate_configurations`, is now linked to the evaluation.

Conclusion

This post walked you through the basics of comparing LLM outputs with mlflow.evaluate(…, model_type=”text”). mlflow.evaluate provides a powerful framework for comparing LLMs and LLM configurations, and for keeping track of those experiments over time. It helps to make prompt engineering and LLM selection less haphazard and more evidence based.

Future posts will expand on the functionality of mlflow.evaluate, including the ability the compute custom metrics during the evaluation process; to incorporate language models from other sources (including OpenAI) in the comparisons; and to better manage evaluation datasets with dataset tracking.

--

--