LLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICA

How to evaluate your RAG using RAGAs Framework

Learn how to evaluate your RAG, following the best industry practices using the RAGAs framework. Learn about Retrieval & Generation specific metrics and advanced RAG chain monitoring using CometML LLM.

Alex Razvant

Published in

Decoding ML

16 min readJun 10, 2024

→ the 10th out of 12 lessons of the LLM Twin free course

What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality, and voice into an LLM.

Why is this course different?

By finishing the “LLM Twin: Building Your Production-Ready AI Replica” free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.

Why should you care? 🫵
→ No more isolated scripts or Notebooks!
Learn production ML by building and deploying an end-to-end production-grade LLM system.

What will you learn to build by the end of this course?

You will learn how to architect and build a real-world LLM system from start to finish — from data collection to deployment.

You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.

The end goal? Build and deploy your own LLM twin.

What is an LLM Twin? It is an AI character that learns to write like somebody by incorporating its style and personality into an LLM.

The architecture of the LLM twin is split into 4 Python microservices:

the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)
the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)
the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML’s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet’s model registry. (deployed on Qwak)
the inference pipeline: load and quantize the fine-tuned LLM from Comet’s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet’s prompt monitoring dashboard (deployed on Qwak)

LLM twin system architecture [Image by the Author]

Along the 4 microservices, you will learn to integrate 3 serverless tools:

Comet ML as your ML Platform;
Qdrant as your vector DB;
Qwak as your ML infrastructure;

Who is this for?

Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps sound principles.
Level: intermediate
Prerequisites: basic knowledge of Python, ML, and the cloud

How will you learn?

The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.

Also, it includes 2 bonus lessons on how to improve the RAG system.

You can read everything at your own pace.

→ To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.

Costs?

The articles and code are completely free. They will always remain free.

But if you plan to run the code while reading it, you must know that we use several cloud tools that might generate additional costs.

The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.

For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.

Meet your teachers!

The course is created under the Decoding ML umbrella by:

Paul Iusztin | Senior ML & MLOps Engineer
Alex Vesa | Senior AI Engineer
Alex Razvant | Senior ML & MLOps Engineer

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Lessons

→ Quick overview of each lesson of the LLM Twin free course.

The course is split into 12 lessons. Every Medium article will be its own lesson:

An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin
The Importance of Data Pipelines in the Era of Generative AI
Change Data Capture: Enabling Event-Driven Architectures
SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG — in Real-Time!
The 4 Advanced RAG Algorithms You Must Know to Implement
The Role of Feature Stores in Fine-Tuning LLMs
How to fine-tune LLMs on custom datasets at Scale using Qwak and Comet ML
Best practices when evaluating fine-tuned LLM models
Architect scalable and cost-effective LLM & RAG inference pipelines
How to evaluate your RAG pipeline using the RAGAs Framework
[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code
[Bonus] Refactoring the 4 Advanced RAG Algorithms using Superlinked… WIP

To better understand the course’s goal, technical details, and system design → Check out Lesson 1

Let’s start with Lesson 10↓↓↓

Lesson 10: How to evaluate your RAG pipeline using the RAGAs Framework

Before jumping into the lesson, let’s walk through a short recap, to understand how we’ve got here:

→ In Lesson 8 we’ve focused on common evaluation methods for various tasks LLMs are performing, specifically in our case of content generation, we’ve used a larger model (GPT3.5-Turbo) via API to assess the coherence and quantified other metrics for our LLM generations.

→ In Lesson 9 we’ve showcased how to implement and deploy the inference pipeline of the LLM twin system on Qwak [2]. Iterated on the microservice-based design, separating the ML and business logic into two layers.

In Lesson 10 we’ll focus on the RAG-evaluation logic.

Here, we’ll showcase the evaluation steps we’re performing, and how we structure the evaluation payload step-by-step. We’ll present one of the best RAG evaluation frameworks (RAGAs [5]) and discuss the metrics, implementation, and other nice functionalities it provides.

Ultimately, we’ll learn how to monitor complex chains by designing each chain step individually, attaching metadata to it, and logging to CometML-LLM.

Here’s what we’re going to learn in this lesson:

Evaluation techniques for RAG applications.
How to use RAGAs to evaluate RAG applications.
How to build metadata chains and log them to CometML-LLM.
The LLM-Twin RAG evaluation workflow.

The LLM-Twin RAG Evaluation Workflow. Image by Author.

What is RAG evaluation?
The RAGAs Framework
How do we evaluate our RAG Application
Advanced Prompt-Chain Monitoring
Conclusion

What is RAG evaluation?

RAG evaluation involves assessing how well the model integrates retrieved information into its responses. This requires evaluating not just the quality of the generated text, but also the accuracy and relevance of the retrieved information, and how effectively it enhances the final output.

Building an RAG pipeline is fairly simple. You just need a Vector-DB knowledge base, an LLM to process your prompts, and additional logic for interactions between these modules.

Reaching a satisfying performance level for a RAG pipeline imposes its challenges because of the “separate” components:

Retriever — which takes care of querying the Knowledge Database and retrieves additional context that matches the user’s query.
Generator — which encompasses the LLM module, generating an answer based on the context-augmented prompt.

When evaluating a RAG pipeline, we must evaluate both components separately and together to understand if and where the RAG pipeline still needs improvement, this will help us identify its “quality”. Additionally, to understand whether its performance is improving, we need to evaluate it quantitatively.

The RAGAs Framework

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. There are existing tools and frameworks that help you build these pipelines, (e.g. LLamaIndex), but evaluating it and quantifying your pipeline performance can be hard.
This is where Ragas (RAG Assessment) comes in.

The RAGAs [5] framework (5.3k ⭐️) is open-source, part of explodinggradients group, and it comes with a paper submission: RAGAs Paper [6]

One of the core concepts of RAGAs is Metric-Driven-Development (MDD) which is a product development approach that relies on data to make well-informed decisions. The focus is to leverage powerful LLMs under the hood to conduct targeted evaluation processes, instead of relying on HITL (human-in-the-loop) for ground truth annotations.

RAGAs Metrics

Let’s iterate over the metrics that RAGAs Metrics [4] expose:

Metrics for Retrieval Stage 🔽:

Context Precision
Evaluates the precision of the context used to generate an answer, ensuring relevant information is selected from the context
Context Relevancy
Measures how relevant the selected context is to the question. Helps improve context selection to enhance answer accuracy.
Context Recall
Measures if all the relevant information required to answer the question was retrieved.
Context Entities Recall
Evaluates the recall of entities within the context, ensuring that no important entities are overlooked in context retrieval.

Metrics for Generation Stage 🔽:

Faithfulness
Measures how accurately the generated answer reflects the source content, ensuring the generated content is truthful and reliable.
Answer Relevance
Assesses how pertinent the answer is to the given question. It is validating that the response directly addresses the user’s query.
Answer Semantic Similarity
Quantifies the semantic similarity between the generated answer and the expected “ideal” answer. Shows that the generated content is semantically aligned with expected responses.
Answer Correctness
Focuses on fact-checking, assessing the factual accuracy of the generated answer.

A subset or all of these metrics could be used throughout the evaluation setup. In our LLM-Twin RAG use case, we’ll use 6 metrics that target both the Retrieval and Generation modules :

Context Precision, Recall, Relevancy, and Entity Recall — for Retrieval.
Answer Relevancy, Answer Semantic Similarity — for Generation.

RAGAs Evaluation Format

To evaluate the RAG pipeline, RAGAs expects the following dataset format:

question       : The user query, this is the input to our RAG.
answer         : The generated answer from the RAG pipeline, given the query + context prompt
contexts       : Context retrieved from the knowledge base (the Vector Database)
ground_truths  : The ground truth answer to the question.

[Note] : The `ground_truths` is necessary only if the ContextRecall metric is used.

📓 All the listed RAGAs metrics use the question , answer and contexts fields. It is important to note that the only metric that requires the ground_truths field is Context Recall. As it measures if all the relevant information required to answer the question was retrieved from the Vector DB.

Here’s a quick example of how a dataset setup for RAGAs looks like:

from datasets import Dataset

questions= ["When was the Eiffel Tower built and how tall is it?"],
answers= ["As of my last update in April 2023, the Eiffel Tower was built in 1889 and is 324m tall"]
contexts= [
   "The Eiffel Tower is one of the most attractive monuments to visit when in Paris, France. It was constructed in 1889 as the entrance arch to the 1889 World's Fair. It stands at 324 meters tall."
  ]
ground_truths=[
    ["The Eiffel Tower was built in 1889 and it stands at 324 meters tall."]
]

sample = {
  "question": questions,
  "answer": answers,
  "contexts": contexts,
  "ground_truths": ground_truths
}

eval_dataset = Dataset.from_dict(sample)

Here’s what the dataset looks like:

#> print(eval_dataset)
Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 1
})

Once the dataset is created, RAGAs require a set of metrics to be passed to the evaluation method:

from ragas import evaluate
from ragas.metrics import (
    answer_similarity,
    context_relevancy,
)

scores = evaluate(
    dataset=eval_dataset,
    metrics=[context_recall, answer_similarity]
)

# Scores will be a dictionary of this format
# scores = {
#    "context_recall": 0.95,
#    "answer_similarity": 0.98
# }

Now that we’ve gone over the prerequisites necessary to work with RAGAs, let’s see the framework applied to our LLM-Twin RAG evaluation use case.

How do we evaluate our RAG Application?

Within this evaluation stage, we’ll focus on this section of the LLM Twin system design:

Section from LLM Twin’s System Design. Image by the author.

Here’s the workflow overview:
1. Defining the Evaluation Prompt Template
2. Define the user query
3. Retrieve context from our Vector Database, related to our user query
4. Format the prompt and pass it to our LLM model.
5. Capture the answer, and use query/context to prepare the evaluation data samples
4. Evaluate with RAGAs
5. Construct the evaluation Chain, append metadata, and log to CometML

The RAG Evaluation workflow. Image by author.

🗒 One interesting detail before diving into the implementation, we should note that we aim to make the LLM-Twin replicate our writing style.

For this particular use case, we could make use of the context that we’re retrieving from our Vector Database as the ground_truth itself when evaluating.

Why❓

Since we already store our writings (posts/articles/code) in the Vector DB, they might play a double role, being the contextwe’re passing to the LLM for generation, and at the same time the ground_truth that we’re comparing the RAG response to, during evaluation.

With that detail in mind, let’s now go through the implementation, following this blueprint:

We’ll go over the Prompt Templates
We’ll prepare the query/response/context payloads for evaluation
Evaluate using RAGAs
Monitoring everything on CometML

The Generation Prompt Template

class InferenceTemplate(BasePromptTemplate):
    simple_prompt: str = """You are an AI language model assistant. Your task is to generate a cohesive and concise response to the user question.
    Question: {question}
    """

    rag_prompt: str = """ You are a specialist in technical content writing. Your task is to create technical content based on a user query given a specific context 
    with additional information consisting of the user's previous writings and his knowledge.
    
    Here is a list of steps that you need to follow in order to solve this task:
    Step 1: You need to analyze the user provided query : {question}
    Step 2: You need to analyze the provided context and how the information in it relates to the user question: {context}
    Step 3: Generate the content keeping in mind that it needs to be as cohesive and concise as possible related to the subject presented in the query and similar to the users writing style and knowledge presented in the context.
    """

    def create_template(self, enable_rag: bool = True) -> PromptTemplate:
        if enable_rag is True:
            return PromptTemplate(
                template=self.rag_prompt, input_variables=["question", "context"]
            )

        return PromptTemplate(template=self.simple_prompt, input_variables=["question"])

Unpacking this template, we’re specifying in the system prompt that our LLM model should analyze the queryin Step1, analyze the retrieved contextin Step2 and to comply with the generation instructions in Step3.

Preparing the Evaluation Payload

Let’s start by iterating each module sequentially. We have defined our PromptTemplate and have assigned the question field with the input query. Next, we would have to retrieve context samples from our Vector Database.

Here’s how the retrieval logic works:

# 1. We instantiate a VectorRetriever that communicates with Vector DB.
retriever = VectorRetriever(query=query)
# 2. Initial fetch of K entries
hits = retriever.retrieve_top_k(
    k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY
)
# 3. Re-rank entries using post-retrieval augmentation techniques
context = retriever.rerank(hits=hits, keep_top_k=settings.KEEP_TOP_K)
# 4. Update context
prompt_template_variables["context"] = context
prompt = prompt_template.format(question=query, context=context)

To get a deeper dive into the Re-Ranking techniques mentioned at Step 3 in the code above, make sure to check 📓 Lesson 5

After we’ve retrieved the context, it’s time to pass our prompt to the inference pipeline deployed on Qwak [2] and get the LLM generation response.

To get a deeper dive how the inference-pipeline was built and deployed,
📓 Lesson 9 covers it in great detail.

Next, we have the evaluation block code:

if enable_evaluation is True:
    if enable_rag:
        st_time = time.time_ns()
        rag_eval_scores = evaluate_w_ragas(
            query=query, output=answer, context=context
        )
        en_time = time.time_ns()
        self._timings["evaluation_rag"] = (en_time - st_time) / 1e9
    st_time = time.time_ns()
    llm_eval = evaluate_llm(query=query, output=answer)
    en_time = time.time_ns()
    self._timings["evaluation_llm"] = (en_time - st_time) / 1e9
    evaluation_result = {
        "llm_evaluation": "" if not llm_eval else llm_eval,
        "rag_evaluation": {} if not rag_eval_scores else rag_eval_scores,
    }
else:
    evaluation_result = None

Key insights from this implementation:

We’re applying the LLM evaluation stage described in Lesson 8 to evaluate (query,response) pairs.
We’re applying the RAG evaluation stage to evaluate (query,response,context) pairs.
We use a _timings dictionary to track the execution duration for performance profiling purposes.

The core RAGAs evaluation functionality is handled within the evaluate_w_ragas method, here’s what it looks like:

from ragas.metrics import (
    answer_correctness,
    answer_similarity,
    context_entity_recall,
    context_recall,
    context_relevancy,
    context_utilization,
)

METRICS = [
    context_utilization,
    context_relevancy,
    context_recall,
    answer_similarity,
    context_entity_recall,
    answer_correctness,
]

def evaluate_w_ragas(query: str, context: list[str], output: str) -> DataFrame:
    """
    Evaluate the RAG (query,context,response) using RAGAS
    """
    data_sample = {
        "question": [query],  # Question as Sequence(str)
        "answer": [output],  # Answer as Sequence(str)
        "contexts": [context],  # Context as Sequence(str)
        "ground_truth": [context],  # Ground Truth as Sequence(str)
    }

    oai_model = ChatOpenAI(
        model=settings.OPENAI_MODEL_ID,
        api_key=settings.OPENAI_API_KEY,
    )
    embd_model = HuggingfaceEmbeddings(model=settings.EMBEDDING_MODEL_ID)
    dataset = Dataset.from_dict(data_sample)
    score = evaluate(
        llm=oai_model,
        embeddings=embd_model,
        dataset=dataset,
        metrics=METRICS,
    )

    return score

What should we note here:

We’re preparing the evaluation dataset using the data_sample dictionary.
We’re instantiating a connector to the OpenAI GPT model, this will be used as the underlying LLM to perform the evaluation logic within RAGAs. The model tag from settings = gpt-4–1106-preview
We’re instantiating a connector to a HuggingFaceEmbeddings model.
We’re using the same embedding model we’ve used to encode our samples before storing them in our Qdrant VectorDB instance.
The model tag from settings = sentence-transformers/all-MiniLM-L6-v2
We’re composing the payload and passing it to the evaluate method.

Once the execution gets to this stage, we might see the following logs section in the console:

Once the evaluation is completed, in the score variable we would have a dict of this format:

score = {
  "context_utilization": float,  #  how useful is context to generated answer
  "context_relevancy": float,    #  how relevant is context to given query
  "context_recall": float,       #  proportion of relevant retrieved context
  "answer_similarity": float,    #  semantic similarity
  "answer_correctness": float,   #  factually correctness
  "context_entity_recall": float,#  recall of relevant entities in context
}

In the next section, let’s compose in a step-by-step fashion, the full evaluation chain and log it to Comet ML LLM [3] for monitoring.

Advanced Prompt-Chain Monitoring

Prompt monitoring is crucial in LLM-based applications for several reasons. It helps ensure the quality and relevance of responses, maintaining accuracy and coherence in user interactions but at the same time allows ML engineers maintaining the project to identify and mitigate bias or hallucination and work on fixing them early on.

📓 In Lesson 8, we’ve described Prompt Monitoring advantages in more detail.

In this section, we’ll focus solely on how to compose end-to-end Chains and log them to Comet ML LLM [3]. Let’s dive into the code and describe each component a Chain consists of.

Step 1: Defining the Chain Start
Here we specify the project, workspace from CometML where we want to log this chain and set its inputs to mark the start.

import comet_llm

comet_llm.init([project])
comet_llm.start_chain(
  inputs={'user_query' : [our query]},
  project=[comet-llm-project],
  api_key=[comet-llm-api-key],
  workspace=[comet-llm-ws]
)

Step 2: Defining Chain Stages
We’re using multiple Span (comet_llm.Span)objects to define chain stages. Inside a Span object, we have to define:

category — which acts as a group key.
name — the name of the current chain step (will appear in CometML UI)
inputs — as a dictionary, used to link with previous chain steps (Spans)
outputs — as a dictionary, where we define the outputs from this chain step.

with comet_llm.Span(
  "category"="RAG Evaluation",
  "name"="ragas_eval",
  "inputs"={"query": [our_query], "context": [our_context], "answers": [llm_answers]}
) as span:
  span.set_outputs(outputs={"rag-eval-scores" : [ragas_scores]})

Step 3: Defining the Chain End
The last step, after starting the chain and appending chain-stages, is to mark the chain’s ending and returning response.


comet_llm.end_chain(outputs={"response": [our-rag-response]})

Now that we’ve understood the logic behind Comet ML LLM [3] Chain monitoring, let’s see what the actual implementation looks like:

# == START CHAIN ==
comet_llm.init(project=f"{settings.COMET_PROJECT}-monitoring")
comet_llm.start_chain(
    inputs={"user_query": query},
    project=f"{settings.COMET_PROJECT}-monitoring",
    api_key=settings.COMET_API_KEY,
    workspace=settings.COMET_WORKSPACE,
)

# == CHAINING STEPS ==
with comet_llm.Span(
    category="Vector Retrieval",
    name="retrieval_step",
    inputs={"user_query": query},
) as span:
    span.set_outputs(outputs={"retrieved_context": context})

with comet_llm.Span(
    category="LLM Generation",
    name="generation_step",
    inputs={"user_query": query},
) as span:
    span.set_outputs(outputs={"generation": llm_gen})

with comet_llm.Span(
    category="Evaluation",
    name="llm_eval_step",
    inputs={"query": llm_gen, "user_query": query},
    metadata={"model_used": settings.OPENAI_MODEL_ID},
) as span:
    span.set_outputs(outputs={"llm_eval_result": llm_eval_output})

with comet_llm.Span(
    category="Evaluation",
    name="rag_eval_step",
    inputs={
        "user_query": query,
        "retrieved_context": context,
        "llm_gen": llm_gen,
    },
    metadata={
        "model_used": settings.OPENAI_MODEL_ID,
        "embd_model": settings.EMBEDDING_MODEL_ID,
        "eval_framework": "RAGAS",
    },
) as span:
    span.set_outputs(outputs={"rag_eval_scores": rag_eval_scores})

# == END CHAIN ==
comet_llm.end_chain(outputs={"response": llm_gen})

📓 For the full chain monitoring implementation, check the PromptMonitoringManager class.

You might have noticed that Spans also have a metadata field attached, we’re using it to log additional data that is important solely to the current chain step.

For instance, in the rag_eval_step , we’re adding the evaluation framework and model types used. In CometML UI, we can see the metadata attached.

Chain Step specific Metadata. Image by Author.

Once the evaluation process is completed, and the chain is logged successfully to Comet ML LLM [3], this is what we’re expecting to see:

Chain logged on CometML. Focus on the LLM Evaluation Stage only.

For a refresher on how we evaluate the LLM model only, make sure to check
📓 Lesson 8 where we’ve described it in detail.

And if we want to see the RAG evaluation scores:

Chain logged on CometML. Focus on the RAG Evaluation Stage only.

Conclusion

Here we’re wrapping up Lesson 10 of the LLM Twin free course.

We’ve described the LLM-Twin RAG evaluation workflow using a powerful framework called RAGAs. We’ve explained the metrics used, how to implement the evaluation functionality and how to compose the evaluation dataset.

Additionally, we’ve showcased and exemplified how to effectively monitor chains with multiple execution steps on Comet ML LLM [3], how to attach metatada, how to group chain-steps and more.

By completing Lesson 10, you’ve gained a good understanding of how you can build a full RAG evaluation pipeline using RAGAs. You’ve learned the Retrieval & Generation specific metrics you could use and all the details required to log large LLM chains to Comet ML LLM [3].

In Lesson 11, we’ll start our bonus series on improving the RAG feature pipeline to make the RAG system more scalable and accurate. We will also show you how to make the code cleaner and more concise.

🔗 Check out the code on GitHub [1] and support us with a ⭐️

References

[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization

[2] Qwak, 2024, The Qwak.ai Platform landing Page

[3] Comet ML LLM, The Comet ML LLM Platform

[4] RAGAs Metrics, The RAGAs Framework Metrics Documentation

[5] RAGAs, The RAGAs Framework Github Repository

[6] RAGAs Paper, 2023, The RAGAs Arxiv Paper