Some best practices in LLM-based solution development using MLflow

How to track and compare LLM artifacts

Published in

TotalEnergies Digital Factory

9 min readSep 19, 2023

LLM is undoubtedly powerful, fun and exciting. However, during development of LLM-based solution, due to the iterative and empirical nature of this process, one can be quickly drown in a swarm of models, configurations, datasets, etc. It is thus important to have a system to keep track of the development process. The why, what, how are combinations that need to be documented and stored in a suitable format for reviewing and reproducible reasons. In principle, one would like to have a system that can:

Track the configuration of each run (e.g., hyperparameters, data version, etc.)
Track performance of each run (validation metrics, etc.)
Track the datasets used for each run (train, test, validation etc…).
Track the code used for each run, to ensure reproducibility.
Compare the performances or configurations of different runs and with the human evaluation.
Track of the model version, especially for large or very large model without wasting storage space and impacting performance.

Fig 1: Important features for a robust LLM-development (Photo by Author)

For classical ML, one can find most of the discussed features in many popular machine learning platforms such as AzureML, SageMaker, Neptune, etc. For LLM development tracking, MLflow — a popular open source ML project — stands out as one of the best options. Thanks to very active contributions in the MLflow project, many helpful functionalities for LLM’s development are publicly available. For some reasons, at the time of writing, a comprehensive guide to ease the adoption is lacking. This post aims to fill the gap.

Simple demo app

Let’s check out some recent best practices by iteratively build a very simple LLM-based application. Everyone likes good songs, so why not build an app on that topic. Our application will be a “song detector” that will suggest the song’s name given some known lyrics.

Fig 2: Simple `Song Detector` demo App (Photo by Author)

The focus of the following sessions is not how to build the app but how to

properly track the development process. Examples given in this guide is for Databricks platform but the principle and most of the code can be easily adapted to mlflow vanilla or for other platforms. The sessions will be organized as follow:

Minimal setup for authentication
Simple model creation
Log models and configurations
Log multiple runs or evaluations
Compare runs
Compare to human evaluation
Log datasets

Minimal setup for authentication

First, we need to obtain the openai_api_key and openai_url from the OpenAI instance.

It is advised to create and store those secrets in a keyvault. For the sake of simplicity without compromise on the security, we will create secret and secret scope directly with databricks cli. If you don’t have the databrikcs cli installed, please check (official installation guide). The below code has been tested with Databricks CLI v).200.2, minor modification in the syntax might be required for earlier versions.

databricks secrets create-scope openai < -p your_databricks_profile >

Create a secret scope: databricks secrets create-scope opinai <-p your_databricks_profile>t.

2. Create secrets: databricks secrets put-secret openai openai_api_key <-p your_databricks_profile>.

3. Enter secret values (for openai_api_key): xxxxxxx .

4. Repeat step 2 and 3 for openai_url secret.

If your company have private OpenAI instance, you should use that instance’s openai_url and openai_api_key to secure your data.

For local development, on the local machine, we can create a .env file to store the secrets. The content of the .env file should be as follow:

OPENAI_API_KEY=xxxxx
OPENAI_API_URL=xxxxx

Note, for VSCode users, depending on your version you might need to add export VARIABLE=XXXX in .env.

Model initiation

In the backend, we can create a simple model based on the openai API. This model will return the name and the band of a song given one part of its lyrics.

def find_song(lyrics: str):
    response = openai.ChatCompletion.create(
    engine="gpt-35", # replace this value with the deployment name you chose when you deployed the associated model.
    messages = [{"role":"system","content":"You are an expert on english song, you will answer only the song's name and nothing else, and say sorry if the lyrics is not existed."},
                {"role":"user","content":f"`{lyrics}` is lyrics of the song: "}],
    temperature=0,
    max_tokens=3 # dont' be alerted by this value, it's for demo purpose only
    )

    return response["choices"][0]["message"]["content"]

We can locally test the function:

import os
import openai

from dotenv import load_dotenv, find_dotenv
from backend import find_song

load_dotenv(find_dotenv())


openai.api_type = "azure"
openai.api_version = "2023-05-15"
openai.api_base = os.getenv("OPENAI_API_URL")
openai.api_key = os.getenv("OPENAI_API_KEY")

lyrics = "Hey Jude, don't make it bad. Take a sad song and make it better."
print("Lyrics: ", lyrics)
print("Answer: ", find_song(lyrics))

The output is:

Not too bad, but the result is not completely correct. With some tunings, we can improve the performance of the function.

You might already noticed that I deliberatedly introduce some problems in the models parameters, it is thus not a big deal to improve the model performance here. In real projects, the iterations for a working solution are a long journey with a lot of trials and errors. Thus, a more revelant question is how to track the development process? We will look at this using built-in mlflow functionalities in Databricks.

Log models and configurations

To log a model, we can use mlflow.pyfunc.log_model() function as the following:

import mlflow

experiment_name = "/Shared/llm-song-finder/"
mlflow.set_experiment(experiment_name)

with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        python_model=find_song,
        artifact_path="model",
        pip_requirements=["openai"]
    )

The model will be logged in the model folder and can be loaded later on using mlflow.pyfunc.load_model() function.

model = mlflow.pyfunc.load_model(model_info.model_uri)
lyrics = "Hey Jude, don't make it bad. Take a sad song and make it better."
print("Lyrics: ", lyrics)
print("Answer: ", model.predict(lyrics))

Log multiple runs or evaluations

As our model is still to be improved, we will try a combination of different parameters to see which one is the best. For the sake of simplicity, only few combinations will be evaluated. To keep track of these trials, we will log each run in MLflow.

First, we would need to modify our find_song() function to take the parameters as input.

import openai

openai.api_key = dbutils.secrets.get(scope = "openai", key = "openai_api_key")
openai.api_base = dbutils.secrets.get(scope = "openai", key = "openai_url")
openai.api_type = 'azure'
openai.api_version = '2023-05-15' # this may change in the future

def find_song(
    lyrics: str,
    temperature: float = 0.5,
    max_tokens: int = 256,
    ):
    response = openai.ChatCompletion.create(
    engine="gpt-35", # replace this value with the deployment name you chose when you deployed the associated model.
    messages = [{"role":"system","content":"You are an expert on english song, you will answer only the song's name and nothing else, and say sorry if the lyrics is not existed."},
                {"role":"user","content":f"`{lyrics}` is lyrics of the song: "}],
    temperature=temperature,
    max_tokens=max_tokens,
   )

    return response["choices"][0]["message"]["content"]

Given this simple function find_song(), let’s evaluate it with different prompts and parameters.

import mlflow
import pandas as pd
import itertools


def generate_input(
    lyrics: str,
    temperature: float = 0.5,
    max_tokens: int = 256,
):
    return {
            "lyrics": lyrics,
            "temperature": temperature,
            "max_tokens": max_tokens,
    }

TEMPERATURE = [0, 1]
MAX_TOKENS=[3, 10]

param_list = [TEMPERATURE, MAX_TOKENS]

combinations = [p for p in itertools.product(*param_list)] # generates 12 combinations
#data=pd.read_csv("queries.csv")
data = pd.DataFrame({"Queries": lyrics_df["Lyrics"]})

experiment_name = "/Shared/llm-song-finder/"
mlflow.set_experiment(experiment_name)

with mlflow.start_run(nested=True):
    for idx,comb in enumerate(combinations):
        with mlflow.start_run(run_name="EVALUATE_PROMPT_"+str(idx), nested=True):
            mlflow.log_params({'temperature':comb[0],'max_tokens':comb[1],'llm_model':'gpt-3.5-turbo'})
            mlflow.log_text
            data['input'] = data['Queries'].apply(lambda x:generate_input(x,comb[0],comb[1]))
            data['temperature'] = comb[0]
            data['max_tokens'] = comb[1]
            data['result'] = data['input'].apply(lambda x:find_song(**x))
            mlflow.log_table(data, artifact_file="eval_results.json")

For each run, we will obtain a table with the results as follow:

The results of each run will be logged in an eval_results.json file as an artifact.

We can find all logged runs in the experiment page:

Compare runs

The interesting part of being able to log all runs is to be able to compare them. `mlflow` provides a very convenient way to compare runs.

By selecting all concerned runs, using evalution tab in experiment panel, we can compare the performance of those runs.

The comparison can be done on the metrics, parameters, or artifacts using a common grouping key, e.g. Queries. Below is an example:

Fig 4: compare different runs to choose the best

As illustrated, we can choose the Compare dropdown menu to select the metric we want to compare. By doing this, we can see in our case the max_tokens seems to be the factor that explains the truncation of the text. Or (not shown here) the `temperature` will determine if our song’s name will be capsulated inside a quote or not.

Compare to human evaluation

One of the most common practices in the development of LLM-based solution is to compare the model’s predictions with the ground truth, e.g. human evaluation. This is a very important step to understand the model’s performance and to improve the model. At the time of writing, I am not aware of a dedicated functionality in mlflow for purpose. However, we can flexibly use mlflow to do this.

experiment_name = "/Shared/llm-song-finder/"
mlflow.set_experiment(experiment_name)

data = pd.DataFrame({"Queries": lyrics_df["Lyrics"]})

with mlflow.start_run(run_name="human-answer"):
    mlflow.log_params({'llm_model':'human-answer'})
    data['result'] = lyrics_df["Songs"]
    mlflow.log_table(data, artifact_file="eval_results.json")

The tips here is to consider the ground truth (human labeling) as another model’s prediction (a human model). By using the same way we log the model, we can log the “human model”. Then we can compare the two models’ predictions using the mlflow’s run comparing capacities.

Fig 4: Compare output of model’s and human’s evaluations

Log inputs

For reproducibility, it is important to keep track of the datasets used for each run. The challenge is to keep track of large datasets and avoid storing them in the mlflow tracking server.

Since the version 2.4.0 MLflow introduced dataset tracking API that provides convenient way to log datasets.

import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset


dataset_source_path = "/dbfs/FileStore/tables/ground_truth_llm_lyrics_demo.csv"
lyrics_df = pd.read_csv(dataset_source_path, sep=";")
# Construct an MLflow PandasDataset from the Pandas DataFrame, and specify the path
# as the source
dataset: PandasDataset = mlflow.data.from_pandas(lyrics_df, source=dataset_source_path)

experiment_name = "/Shared/llmops_experiment/"
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name="log-input"):
    # Log the dataset to the MLflow Run. Specify the "validating" context to indicate that the
    # dataset is used for model validating
    mlflow.log_input(dataset, context="validating")

The pointer (url, path, etc.) to the dataset and its profile will be logged into mlflow tracking server and can be retrieved later.

# Retrieve the run, including dataset information
run = mlflow.get_run(mlflow.last_active_run().info.run_id)
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

Now as we go to a given run, we can see the dataset used for this run:

Fig 5: Example of logged validating dataset.

I found this feature very handy but as disclaimed in the mlflow documentation, the dataset logging API is still experimental and subject to change. Any developement based on this feature should be aware of this.

Conclusion

Through an oversimplified iteration to build a small song detector app based on OpenAI API, we have checked out recent functionalities in mlflow that might help to improve the traceability of LLM-based solution processes. This guide is intended to give you an appetite on the subject. Many other interesting features are to be explored.

In the next guide, let’s discuss about different approaches to evaluate LLM pipelines and how to couple them with an effective monitoring and alerting system.

Special thanks to LutzOfficial for the constructive comments!

Some best practices in LLM-based solution development using MLflow

How to track and compare LLM artifacts

Written by An Truong