LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA

Best Practices When Evaluating Fine-Tuned LLMs

How to evaluate a custom fine-tuned model, leveraging GPT3.5-Turbo, custom qualitative evaluation templates while monitoring prompts and chains using Comet ML LLM.

Alex Razvant

Published in

Decoding ML

16 min readMay 26, 2024

→ the 8th out of 12 lessons of the LLM Twin free course

What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality, and voice into an LLM.

Why is this course different?

By finishing the “LLM Twin: Building Your Production-Ready AI Replica” free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.

Why should you care? 🫵
→ No more isolated scripts or Notebooks!
Learn production ML by building and deploying an end-to-end production-grade LLM system.

What will you learn to build by the end of this course?

You will learn how to architect and build a real-world LLM system from start to finish — from data collection to deployment.

You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.

The end goal? Build and deploy your own LLM twin.

What is an LLM Twin? It is an AI character that learns to write like somebody by incorporating its style and personality into an LLM.

The architecture of the LLM twin is split into 4 Python microservices:

the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)
the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)
the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML’s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet’s model registry. (deployed on Qwak)
the inference pipeline: load and quantize the fine-tuned LLM from Comet’s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet’s prompt monitoring dashboard (deployed on Qwak)

LLM twin system architecture [Image by the Author]

Along the 4 microservices, you will learn to integrate 3 serverless tools:

Comet ML as your ML Platform;
Qdrant as your vector DB;
Qwak as your ML infrastructure;

Who is this for?

Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps sound principles.
Level: intermediate
Prerequisites: basic knowledge of Python, ML, and the cloud

How will you learn?

The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.

Also, it includes 2 bonus lessons on how to improve the RAG system.

You can read everything at your own pace.

→ To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.

Costs?

The articles and code are completely free. They will always remain free.

But if you plan to run the code while reading it, you must know that we use several cloud tools that might generate additional costs.

The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.

For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.

Meet your teachers!

The course is created under the Decoding ML umbrella by:

Paul Iusztin | Senior ML & MLOps Engineer
Alex Vesa | Senior AI Engineer
Alex Razvant | Senior ML & MLOps Engineer

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Lessons

→ Quick overview of each lesson of the LLM Twin free course.

The course is split into 12 lessons. Every Medium article will be its own lesson:

An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin
The Importance of Data Pipelines in the Era of Generative AI
Change Data Capture: Enabling Event-Driven Architectures
SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG — in Real-Time!
The 4 Advanced RAG Algorithms You Must Know to Implement
The Role of Feature Stores in Fine-Tuning LLMs
How to fine-tune LLMs on custom datasets at Scale using Qwak and Comet ML
Best practices when evaluating fine-tuned LLM models
Architect scalable and cost-effective LLM & RAG inference pipelines
How to evaluate your RAG pipeline using the RAGAs Framework
[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code
[Bonus] Refactoring the 4 Advanced RAG Algorithms using Superlinked… WIP

To better understand the course’s goal, technical details, and system design → Check out Lesson 1

Let’s start with Lesson 8↓↓↓

Lesson 8: Best practices when evaluating fine-tuned LLM models.

This lesson will focus on evaluating our fine-tuned LLM Twin model.

Before doing that, let’s walk through a short recap, to understand how we’ve got to the LLM evaluation stage:

→ In Lesson 6, we showcased extracting filtered data samples from QDrant. Using Knowledge Distillation, we have the GPT3.5 Turbo to structure and generate the fine-tuning dataset that is versioned with Comet ML.

→ In Lesson 7, we built the fine-tuning pipeline using the versioned datasets we’ve logged on Comet ML, composed the workflow, and deployed the pipeline on Qwak [2] to train our model.

→ In Lesson 8 we’ll focus on common evaluation methods for various tasks LLMs are performing, specifically in our case of content generation, we’ll focus on human-in-the-loop and use a larger model to assess the coherence and quantify other metrics for our LLM generations.

It is important to differentiate between evaluating LLM models singlehandedly and evaluating LLM-based systems.
During LLM evaluation, we focus only on how our fine-tuned model generates content and how cohesive is the generation.

Here’s what we’re going to learn in this lesson:

Common LLM evaluation methods for different LLM tasks.
Composing evaluation prompt templates for specific use cases.
Prompt, Chain Monitoring, and CometLLM integration.
The LLM-Twin model evaluation workflow.

What is LLM evaluation?
Evaluation Techniques
How we evaluate our LLM-Twin Model
Comet ML Prompt Monitoring
Conclusion

What is LLM evaluation?

LLM evaluation is a crucial process used to assess the performance and capabilities of the models. It involves a series of tests and analyses to determine how well the model understands, interprets, and generates human-like text.

Being a fairly recent and fast-evolving AI field, LLM evaluation is not straightforward and there is no unified approach to measure their performance.

Due to the generative nature of LLMs, the evaluation processes for these models involve both quantitative and qualitative assessments.

Ensuring the effectiveness and safety of LLMs in practical applications should be a mandatory goal. Evaluating LLM models to reduce hallucinations, guarantee accuracy, and ethical use is crucial as they become more integrated into diverse sectors.

Several metrics have been proposed in the literature for evaluating the performance of LLMs. It is essential to use the right metrics suitable for the problem we are attempting to solve.

LLM Evaluation vs RAG Evaluation
LLM evaluation focuses on the model’s ability to generate coherent, relevant, and contextually appropriate text based solely on its pre-trained knowledge.
This involves assessing metrics such as fluency, coherence, relevance, and adherence to the given prompts.

RAG evaluation involves assessing how well the model integrates retrieved information into its responses. This requires evaluating not just the quality of the generated text, but also the accuracy and relevance of the retrieved information, and how effectively it enhances the final output.
Metrics for RAG models often include precision and recall of the retrieval process, as well as the overall coherence and relevance of the augmented generation.

Next, let’s iterate over a few commonly model valuation used techniques.

Evaluation Techniques

Let’s split these techniques by their intended use case.

Quantitative evaluation
Quantitative evaluation involves statistical measures to assess the accuracy, fluency, and other aspects of the generated text.
Here are some common metrics:

Perplexity
Lower perplexity indicates better performance and reflects the model’s ability to anticipate the next word in a sequence.
BLEU (Bilingual Evaluation Understudy):
Compares the n-gram overlap between the generated text and a reference text. Commonly used for machine translation, it’s also applicable to text-generation tasks. A higher BLEU score indicates better quality.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
Measures the overlap of n-grams, longest common subsequence, and word sequences between the generated text and reference texts.
It’s widely used for evaluating summarization and translation models.

In case the LLM output is in a structured format, one could evaluate it against classical ML metrics, such as the following:

Accuracy:
The ratio of correctly predicted instances to the total instances.
Useful for tasks where the output is categorical or where there is a clear right or wrong answer, for example, named entity recognition (NER).
Precision, Recall, and F1 Score:
The ratio of true positive predictions to the total positive predictions made by the model, the ratio of positive predictions to the total predictions, and the harmonic mean of precision/recall to quantify the balance between the two. Valuable in classification or entity extraction tasks performed by LLMs.

Qualitative evaluation
Qualitative evaluation involves human-in-the-loop judgment or larger models assessing aspects like relevance, coherence, creativity, and appropriateness of the content.
This type of evaluation provides insights that quantitative metrics might miss.

Human Review:
Having domain experts or general users review the generated content to assess its quality based on various criteria such as coherence, fluency, relevance, and creativity.
Human-in-the-loop:
Reinforcement Learning from Human Feedback, RLHF — humans can rate the quality of model outputs, and this feedback is used to fine-tune the model through reinforcement learning techniques.
LLM-based Evaluation:
Involves using a larger general-knowledge model to evaluate the model’s behavior.

In our particular case, quantitative methods like BLEU & ROUGE are not applicable as they can’t yield valuable insights. Since we’re evaluating how our fine-tuned LLM can generate written content, and its task is not summarisation or translation-oriented, we can effectively only evaluate the quality of the generated content using an LLM-based evaluation technique.

Why BLEU & ROUGE don’t work in our use case?

They focus on measuring N-gram Overlaps.
The generated content might have high variations in wording while still reflecting the user’s query.
Lack of Semantic Understanding.
They do not help evaluate the depth, coherence, or originality of the content.
Weak Creativity
Can’t quantify stylistic elements or the overall human-like quality.

How we evaluate our LLM-Twin Model

We aim to verify if our fine-tuned model can generate contextual accurate posts/articles to reflect the provided query.

Within this LLM evaluation stage, we’ll focus on this section of the LLM Twin system design:

Section from LLM Twin’s System Design. Image by the author.

Here’s the workflow overview:
1. Defining the Evaluation Prompt Template
2. Define the user query
3. Generate content based on the user query
4. Populate the evaluation template
5. Use GPT3.5-Turbo to evaluate
6. Log evaluation prompt on Comet ML LLM.

The Evaluation Prompt Template

from abc import ABC, abstractmethod

from langchain.prompts import PromptTemplate
from pydantic import BaseModel


class BasePromptTemplate(ABC, BaseModel):
    @abstractmethod
    def create_template(self, *args) -> PromptTemplate:
        pass

class LLMEvaluationTemplate(BasePromptTemplate):
    prompt: str = """
        You are an AI assistant and your task is to evaluate the output generated by another LLM.
        You need to follow these steps:
        Step 1: Analyze the user query: {query}
        Step 2: Analyze the response: {output}
        Step 3: Evaluate the generated response based on the following criteria and provide a score from 1 to 5 along with a brief justification for each criterion:

        Evaluation:
        Relevance - [score]
        [1 sentence justification why relevance = score]
        Coherence - [score]
        [1 sentence justification why coherence = score]
        Conciseness - [score]
        [1 sentence justification why conciseness = score]
"""

    def create_template(self) -> PromptTemplate:
        return PromptTemplate(template=self.prompt, input_variables=["query", "output"])

Unpacking this template, we’re specifying that given a user query and the generated responsefrom our fine-tuned model, the evaluation model should analyze both (query, response) and rank the relationship between the query and the response on 3 criteria.

→ Relevance

Relevance measures how well the generated content aligns with the user query.

It calculates:

Content Match: how closely the generated content addresses the question posed by the query.
Topicality: The degree to which the content stays on-topic.

Example Evaluation Criteria:

Does the response directly answer the query?
Are the key points of the query adequately covered?
Is the information provided accurate and pertinent to the topic?

→ Cohesiveness

How logically and smoothly the generated text flows.

It calculates:

Sentence Structure: how well sentences are constructed to relate to each other.
Clarity of Thought: overall readability and understandability of the text.

Example Evaluation Criteria:

Are the ideas presented in a logical order?
Do transitions between sentences and paragraphs make sense?
Is the text easy to follow and understand?

→ Conciseness

How compact is the generated text, free from unnecessary or redundant words.

It calculates:

Elimination of Redundancy: avoidance of repetitive information.
Directness: the ability to communicate ideas straightforwardly.

Example Evaluation Criteria:

Is the text compact and to the point?
Are there any redundant or repetitive phrases?

For all these criteria, we’re asking the larger LLM (GPT3.5-Turbo) to rank each of them on a 1–5 scale.

Code Walkthrough

Here’s how we define our eval method logic, where we compose, populate, and send the full prompt to GPT3.5-Turbo.

def eval(query: str, output: str) -> str:
    evaluation_template = templates.LLMEvaluationTemplate()
    prompt_template = evaluation_template.create_template()

    model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY)
    chain = GeneralChain.get_chain(
        llm=model, output_key="llm_eval", template=prompt_template
    )

    response = chain.invoke({"query": query, "output": output})

    return response["llm_eval"]

The full eval workflow looks like this:

class LLMTwin:
    def __init__(self) -> None:
        self.qwak_client = RealTimeClient(
            model_id=settings.QWAK_DEPLOYMENT_MODEL_ID,
            model_api=settings.QWAK_DEPLOYMENT_MODEL_API,
        )
        self.template = InferenceTemplate()
        self.prompt_monitoring_manager = PromptMonitoringManager()

    def generate(
        self,
        query: str,
        enable_rag: bool = False,
        enable_evaluation: bool = False,
        enable_monitoring: bool = True,
    ) -> dict:
        prompt_template = self.template.create_template(enable_rag=enable_rag)
        prompt_template_variables = {
            "question": query,
        }

        if enable_rag is True:
            retriever = VectorRetriever(query=query)
            hits = retriever.retrieve_top_k(
                k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY
            )
            context = retriever.rerank(hits=hits, keep_top_k=settings.KEEP_TOP_K)
            prompt_template_variables["context"] = context

            prompt = prompt_template.format(question=query, context=context)
        else:
            prompt = prompt_template.format(question=query)

        input_ = pd.DataFrame([{"instruction": prompt}]).to_json()

        response: list[dict] = self.qwak_client.predict(input_)
        answer = response[0]["content"][0]

        if enable_evaluation is True:
            evaluation_result = evaluate_llm(query=query, output=answer)
        else:
            evaluation_result = None

        if enable_monitoring is True:
            if evaluation_result is not None:
                metadata = {"llm_evaluation_result": evaluation_result}
            else:
                metadata = None

            self.prompt_monitoring_manager.log(
                prompt=prompt,
                prompt_template=prompt_template.template,
                prompt_template_variables=prompt_template_variables,
                output=answer,
                metadata=metadata,
            )
            self.prompt_monitoring_manager.log_chain(
                query=query, response=answer, eval_output=evaluation_result
            )

To check the full implementation, visit our LLM Twin Repository [1].

Note that here we’re using the model deployed in our previous lesson, in the training pipeline, which will serve as the model we’re evaluating. The model we’ll deploy to production, within the inference pipeline will be a separate instance, resulting after we’ve selected the best candidate based on our evaluation results.

In the next lesson, Lesson 9, we’ll discuss the inference pipeline and the production deployment in detail.

Points from this implementation:

We’re passing the query to our evaluation Mistral7b model deployed on Qwak [2]
We get the response and pass the (query, response) pair to evaluation.
The evaluation template is populated and sent to GPT3.5-Turbo
The resulting prompt is logged to Comet ML LLM [3] feature.

To find more about how we’ve deployed the training pipeline on Qwak, where we fine-tuned Mistral7b-Instruct on a custom dataset, check Lesson 7.

Here’s an example:

query: 
Could you please draft a LinkedIn post discussing Vector Databases? 
I`m particularly interested in how do they work.

response:
Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches. 
At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors. 
These vectors are derived from the data itself, typically through techniques like hashing or embedding. 
From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches. 
By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets. 
Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"

Next, you can see the logs from our Evaluation Chain.

> Entering new LLMChain chain...
Prompt after formatting:
You are an AI assistant and your task is to evaluate the output generated by another LLM.
    You need to follow these steps:
    Step 1: Analyze the user query: Could you please draft a LinkedIn post discussing Vector Databases? I'm particularly interested in how do they work.
    Step 2: Analyze the response: {'content': ["<s> You are an AI language model assistant. Your task is to generate a cohesive and concise response to the user question.\n    Question: Could you please draft a LinkedIn post discussing Vector Databases? I'm particularly interested in how do they work.\n\nAnswer: Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches. At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors. These vectors are derived from the data itself, typically through techniques like hashing or embedding. From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches. By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets. Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"]}
    Step 3: Evaluate the generated response based on the following blueprint, of [rank_score] - [description]:
    - Relevance [rank_score] - [description] : where you give a score from 1 to 5 on how relevant the output is to the user query.
    - Coherence [rank_score] - [description] : where you give a score from 1 to 5 on how coherent the output is.
    - Conciseness [rank_score] - [description]: where you give a score from 1 to 5 on how concise the output is.
    
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

> Finished chain.
Step 1: Analyze the user query:
The user is requesting a LinkedIn post draft that discusses Vector Databases, with a focus on their functionality.

Step 2: Analyze the response:
The response generated by the other LLM provides an answer that explains vector databases, how they represent data, their similarity to search engines, and touches on the process of indexing and searching within these databases.

Step 3: Evaluate the generated response:
- Relevance [4] - The output is highly relevant as it directly addresses the user's interest in vector databases and how they work.
- Coherence [5] - The output is coherent as it presents a logical flow of information regarding vector databases.
- Conciseness [4] - The output is fairly concise, delivering a good amount of information in a compact format suitable for a LinkedIn post.

Comet ML Prompt Monitoring

Apart from the rich feature set for experiment tracking, Comet ML LLM [3] also offers quite useful features to monitor your LLM-based applications.

Why Monitoring Prompts?
Prompt monitoring is crucial in LLM-based applications for several reasons. It helps ensure the quality and relevance of responses, maintaining accuracy and coherence in user interactions but at the same time allows ML engineers maintaining the project to identify and mitigate bias or hallucination and work on fixing them early on.

Why is it a best practice?

By logging and inspecting multiple sets of resulting prompts, one could extract insights into a generalized metric.
Useful for RLHF analysis
Useful to inspect a full chain, alongside the metadata, processing time, and chain stages being executed.

Other advantages include filtering out inappropriate content and providing real-time feedback, accessible from a centralized dashboard on how the model behaves.

Apart from monitoring the actual prompt, we’ll also log the chain logic workflow that will allow us to enhance the debugging process step-by-step, to identify if any chain-stage might have corrupted the end response.

Below, you’ll find an example of a chain + prompt monitoring dashboard from Comet ML LLM [3]:

To log prompts to Comet LLM, we used this straightforward implementation:


  def log(
      cls,
      prompt: str,
      output: str,
      prompt_template: str | None = None,
      prompt_template_variables: dict | None = None,
      metadata: dict | None = None,
  ) -> None:
      comet_llm.init()

      metadata = metadata or {}
      metadata = {
          "model": settings.MODEL_TYPE,
          **metadata,
      }

      comet_llm.log_prompt(
          workspace=settings.COMET_WORKSPACE,
          project=f"{settings.COMET_PROJECT}-monitoring",
          api_key=settings.COMET_API_KEY,
          prompt=prompt,
          prompt_template=prompt_template,
          prompt_template_variables=prompt_template_variables,
          output=output,
          metadata=metadata,
      )

To log chains, we have to log each chain step in order. In the example below, we’ve started the chain using the {"user_query" : query} and have linked the next chain stage using the comet_llm.Span where the inputs must be the same as the previous stage.

We would have a chain INPUT -> TWIN_RESPONSE -> GPT3.5-EVAL -> END .

For more details on structuring and logging chains on Comet ML LLM [3], check
🔗 Comet ML Chain Logging [4]


def log_chain(cls, query: str, response: str, eval_output: str):
    comet_llm.init(project=f"{settings.COMET_PROJECT}-monitoring")
    comet_llm.start_chain(
        inputs={"user_query": query},
        project=f"{settings.COMET_PROJECT}-monitoring",
        api_key=settings.COMET_API_KEY,
        workspace=settings.COMET_WORKSPACE,
    )
    with comet_llm.Span(
        category="twin_response",
        inputs={"user_query": query},
    ) as span:
        span.set_outputs(outputs=response)

    with comet_llm.Span(
        category="gpt3.5-eval",
        inputs={"eval_result": eval_output},
    ) as span:
        span.set_outputs(outputs=response)
    comet_llm.end_chain(outputs={"response": response, "eval_output": eval_output})

Conclusion

Here we’re wrapping up Lesson 8 of the LLM Twin free course.

We’ve described common evaluation metrics, quantitative and qualitative, and have exemplified a common evaluation approach using a larger model (GPT3.5-Turbo) to assess and rank our model’s responses based on relevance, cohesiveness, and conciseness.

Completing Lesson 8, you’ve gained a good understanding of what LLM evaluation represents, the common metrics used, how to compose an evaluation prompt template, how to populate it, and how to monitor the resulting evaluation insights using the Comet ML LLM [3] feature, where we have shown how to log single prompts and entire chains.

In Lesson 9, we’ll cover the process of building the inference RAG pipeline. We’ll connect the various components of the LLM-Twin system, such as the QDrant Vector DB and Qwak Inference Pipeline, and prepare the system as a complete deployment. See you there!

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Enjoyed This Article?

Join the Decoding ML Newsletter for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For FREE ↓

Decoding ML Newsletter | Substack

Join for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For…

decodingml.substack.com

References

[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization

[2] Qwak, 2024, The Qwak.ai Platform landing Page

[3] Comet ML LLM, The Comet ML LLM Platform

[4] Comet ML Chain Logging, The Comet ML LLM Documentation

LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA

Best Practices When Evaluating Fine-Tuned LLMs

How to evaluate a custom fine-tuned model, leveraging GPT3.5-Turbo, custom qualitative evaluation templates while monitoring prompts and chains using Comet ML LLM.

Why is this course different?

What will you learn to build by the end of this course?

Who is this for?

How will you learn?

Costs?

Meet your teachers!

Lessons

Lesson 8: Best practices when evaluating fine-tuned LLM models.

Table of Contents

What is LLM evaluation?

Evaluation Techniques

How we evaluate our LLM-Twin Model

→ Relevance

→ Cohesiveness

→ Conciseness

Code Walkthrough

Comet ML Prompt Monitoring

Conclusion

Enjoyed This Article?

Decoding ML Newsletter | Substack

Join for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For…

References

Written by Alex Razvant