Evaluating & Tracking LLMs using MLflow Model Evaluation & Phoenix -part-2

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

7 min readNov 6, 2023

In the previous article we have seen how to implement MLflow AI Gateway feature which will give us the flexibility of managing LLMs at their configurations at one single place, in today’s article we will see how we can implement the Evaluation and Tracking feature of LLMs.

credits to the original MLflow and internet

With the advent of Large Language Models (LLMs), they demonstrated their extreme capability in text generation across various domains like question answering, translation, and text summarization. Assessing the performance of LLMs presents unique challenges, given that there often isn’t a single definitive ground truth for comparison. MLflow addresses this by offering the mlflow.evaluate() API for LLM evaluation.

MLflow’s LLM evaluation functionality comprises three key components:

The model to be evaluated: This could be an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any Python callable representing your model, such as a HuggingFace text summarization pipeline.
Metrics: LLM evaluate employs LLM-specific metrics to compute performance.
Evaluation data: This refers to the dataset against which your model’s performance is assessed. It can be in the form of a pandas DataFrame, a Python list, a numpy array, or an instance of mlflow.data.dataset.Dataset().

Without much delay let's us jump to practical implementation of LLM evaluation. Let's prepare a data frame of questions and ground truth as below.

import pandas as pd

eval_df = pd.DataFrame(
    {
        "inputs": [
            "Explain ReactJS and its main use cases?",
            "Explain Node.js and its main use cases?"
        ],
        "ground_truth": [
            "ReactJS, commonly known as React, is an open-source JavaScript library used for building user interfaces."
            "Developed and maintained by Facebook, React allows developers to create interactive and dynamic user interface components for web applications."
            "It is highly popular for its component-based architecture, which makes it easy to build reusable UI elements."
            "React provides a virtual DOM (Document Object Model) that enhances the performance of web applications."
            "It allows React to efficiently update only the parts of the actual DOM that have changed, resulting in faster rendering and improved user experience.",
            "Node.js, often referred to as Node, is an open-source, cross-platform runtime environment that executes JavaScript code outside of a web browser."
            "Node.js is commonly used for building server-side applications and network applications."
            "Its main use cases include developing web servers, real-time applications like chat applications, and building APIs (Application Programming Interfaces)."
            "Node.js's non-blocking I/O and event-driven architecture make it well-suited for handling concurrent and data-intensive operations."
            "Node.js is built on the V8 JavaScript engine and is known for its asynchronous, event-driven architecture. This architecture enables Node.js to handle numerous connections concurrently without blocking the execution of other code.",

        ]
    }

Now lets us evaluate “GPT-4” model as below.


import mlflow
import openai
from mlflow.metrics.genai import answer_similarity, answer_relevance

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4" as an MLflow model.
    logged_model = mlflow.openai.log_model(
        model="gpt-4",
        task=openai.ChatCompletion,
        artifact_path="models", # path to local evalution artifact
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    print(logged_model.model_uri)
    results = mlflow.evaluate(
        logged_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        extra_metrics=[answer_similarity(), answer_relevance()]
    )

Once you run the code will generate the answer for the questions that are provided and compare them with the ground truth for generating evaluation metrics. Once the execution is complete now, we will launch the MLflow UI from your source directory in your terminal as
mlflow ui --host 127.0.0.1 --port 3000

UI showing the mlflow run with evaluation metrics.

MLflow’s Large Language Model (LLM) evaluation framework simplifies the process of evaluating LLMs by providing default metric collections for specific tasks, such as “question-answering.” These predefined metric sets streamline the evaluation process, making it more efficient. When using these default metrics, you can specify the model_type argument within the mlflow.evaluate() function, as demonstrated in the above example.

For instance, if you’re evaluating a question-answering model, you can use the default metrics collection for this specific task, which includes metrics like “exact-match,” “toxicity,” “ARI grade level,” and “Flesch-Kincaid grade level.” Similarly, other supported model types like “text-summarization” and “text models” have their own predefined metric sets tailored to their respective tasks. By specifying the appropriate model_type when invoking mlflow.evaluate(), you can easily assess an LLM's performance according to the metrics relevant to the use case at hand.

Making LLM as judge:

In MLflow, there’s a set of pre-canned metrics that leverage Large Language Models (LLMs) as judges for evaluating model performance. These metrics are particularly useful when assessing the quality of LLM-generated outputs in comparison to ground_truth information. The first metric, mlflow.metrics.genai.answer_similarity measures how similar the model’s output is to the ground_truth. Higher scores indicate that the model’s output closely aligns with the ground_truth, while lower scores suggest discrepancies.

The second metric, mlflow.metrics.genai.answer_correctness builds upon answer_similarity and assesses the factual correctness of the model’s output relative to the ground_truth. Higher scores here indicate not only similarity but also factual accuracy.

The mlflow.metrics.genai.answer_relevance metric focuses on the relevance of the model’s output to the input, disregarding the context. It evaluates whether the model’s outputs are on-topic or not.

The mlflow.metrics.genai.relevance metric delves further, considering both the input and context’s relevance to the model’s output. It assesses whether the model has comprehended the context and extracted relevant information from it.

Lastly, mlflow.metrics.genai.faithfulness evaluates the faithfulness of the model’s output concerning the provided context. High scores indicate that the model’s output aligns well with the context, while low scores suggest disagreements.

These metrics provide a comprehensive way to gauge LLM performance by considering various aspects like similarity, correctness, relevance, and faithfulness to ground_truth and context, making them valuable tools for model evaluation.

LLM Tracing:

In this article we will use Arize Phoenix as our LLM observability tool, as it offers a hassle-free observability solution for Large Language Model Operations (LLMOps), providing quick insights without the need for extensive configuration. With a primary focus on a notebook-centric experience, Phoenix allows you to monitor your models and Large Language Model (LLM) Applications effectively. It achieves this by delivering two key functionalities:

LLM Traces: This feature enables you to do deep dive into the inner workings of your LLM Application. It provides detailed insights by tracing the execution of your application, making it easier to diagnose issues related to retrieval and tool execution.
LLM Evals: Phoenix harnesses the capabilities of large language models to assess the relevance and toxicity of your generative models or applications. This functionality empowers you to evaluate the quality and performance of your models efficiently.

Let’s look into the practical implementation of the observability configuration.

from dotenv import load_dotenv, find_dotenv
from llama_index.callbacks import CallbackManager
from llama_index.chat_engine.types import ChatMode
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, LLMPredictor, OpenAIEmbedding
from phoenix.trace.llama_index import OpenInferenceTraceCallbackHandler
import phoenix as px

def load_gpt35_turbo():
    llm = OpenAI(model='gpt-3.5-turbo', temperature=0, max_tokens=256)
    llm_predictor = LLMPredictor(llm=llm)
    embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor,
                                                   embed_model=embed_model,
                                                   callback_manager=CallbackManager(handlers=[callbackHandler]),
                                                   chunk_size=500, chunk_overlap=100)
    return VectorStoreIndex.from_documents(documents=documents, service_context=service_context)


# View the traces in the Phoenix UI
session = px.launch_app()
print(f'please launch your phoenix ui at {session.url}')

payload = "what all ares where generative ai can be used ?"
query_engine = load_gpt35_turbo().as_chat_engine(verbose=False, chat_mode=ChatMode.REACT)
print(payload)
result = query_engine.chat(payload)

When I fire the above query as a standalone program or from postman by wrapping that in an API below is the response.

but what exactly happened in the background like, how many chunks are created on the data we provided?, what context is extracted from which pages of data?, what kind of prompt template is used? etc all these questions are answered using observability or tracing of LLMs as below

summary of traces of queries that the model received

in depth: checking what context is extracted and passed to model

Conclusion:
In this article we have covered different concepts of LLMs like LLMs as judge by using LLM evaluation metrics and LLM traceability using Arize Phoenix . These two feature will help us create LLM based application that can serve customer needs with much accuracy and give us more insights what exactly customer needs.

Support:

20 claps if you feel the article informative.
40 claps if you feel the article is different and can be used in number of ways.
60 claps if you feel the article is out of the box research.
100 claps if you feel the article is extraordinary.
+1 additional clap for making the code & content generally available and contributed to the community.

Evaluating & Tracking LLMs using MLflow Model Evaluation & Phoenix -part-2

Written by M K Pavan Kumar