LLM Evaluation Toolkit for RAG Pipelines

Introducing a novel SuperKnowa framework for evaluating LLM-based RAG pipelines and automatically generating leaderboard

Published in

Towards Generative AI

5 min readAug 21, 2023

In the rapidly evolving field of Natural Language Processing (NLP), Retriever-Augmented Generation (RAG) pipelines have emerged as a powerful approach to question-answering tasks. We have covered the different building blocks of the RAG pipeline in detail in our previous blogs. The ability to evaluate these models effectively is crucial for ongoing development and research. In this blog post, we’ll explore an evaluation framework that we have specifically developed for RAG pipelines, offering insights into its structure, purpose, and key components.

Purpose and Need for Evaluation

Evaluating RAG pipelines is essential for several reasons:

Model Comparison: Understanding how different models perform under similar conditions.
Performance Tuning: Identifying areas for improvement and optimizing model parameters.
Insights and Analysis: Gaining insights into how models are functioning, what they are getting right, and where they are going wrong.
Collaboration and Sharing: Facilitating collaboration between researchers, data scientists, and practitioners by standardizing the evaluation process.

Our evaluation framework for RAG pipelines includes a comprehensive toolkit designed to make the evaluation process more streamlined and insightful. Here’s an overview of our evaluation toolkit:

1. Configurable JSON File: Allows users to specify various parameters, settings, and configurations.
2. MLFlow Integration: Offers capabilities for tracking experiments, sharing insights, and collaborating across teams.
3. Visualization Tools: Provides interactive dashboards and visualizations for in-depth analysis.
4. Customizable Evaluation Metrics: Supports a wide range of evaluation metrics like BLEU, METEOR, ROUGE, and more.
5. RAG Model Evaluation: A detailed guide for evaluating the RAG pipeline using various datasets, models, and evaluation metrics.

Datasets Used for Evaluation

Evaluating the SuperKnowa Model requires diverse and representative datasets. Here’s an overview of the datasets utilized:

1. IBM Test Data: Enterprise-related questions and answers on IBM products.
2. CoQA Data: Conversational question-answering dataset.
5. QuAC Data: Questions on academic content.
6. TidyQA Data: General knowledge questions and answers.

Models Utilized for Comparison

To gauge the RAG Model’s effectiveness, it is compared against various other models, including but not limited to:

COGA: IBM trained foundation model fine-tuned for Conversational Question Answering
FlanT5-XXL: A larger variant of the FlanT5 model.
Bloom: BigScience Large Open-science Open-access Multilingual Language Model

Evaluation Metrics and Scores

A range of evaluation metrics is employed to provide a comprehensive assessment of the SuperKnowa Model:

BLEU Score: Measures the similarity between a generated sentence and a reference sentence.
METEOR Score: Considers synonyms, stemming, and more.
ROUGE Score: Measures the overlap between a generated summary and a reference summary.
SentenceSim Score: Compares semantic similarity between sentences.
SimHash Score: Used for duplicate detection.
Perplexity Score: Measures how well the model predicts the sample.
BLEURT Score: Evaluates text generation quality.
F1 Score: Harmonic mean of precision and recall.
BERT Score: Considers contextual embeddings from BERT.

Experiment Parameters

The evaluation process is controlled by various parameters which can be specified in the config file. These parameters may include:

Model Name: Specifies the model being evaluated.
Temperature: Controls randomness in model output.
Top P: Influences the selection of top predictions.
Dataset: Identifies the dataset used for evaluation.

Introduction to MLFLow Package

The MLFlow package serves as an essential utility within the RAG pipeline evaluation framework. It’s designed to enhance and extend MLFlow’s functionalities, providing specific methods and functions tailored to the evaluation process’s needs. Whether you’re managing experiments, logging metrics, or visualizing results, this script offers valuable capabilities.

Key Functions and Utilities

Here’s a look at some of the key functions and utilities within the MLFlow package script:

Experiment Management: Functions to create, list, and manage MLFlow experiments.
2. Metric Logging: Methods for logging metrics, parameters, and artifacts in a standardized manner.
3. Model Tracking: Tools to log and manage models, including versioning and comparison.
4. Result Visualization: Utilities to visualize results using various plots and charts.
5. Run Management: Functions to start, monitor, and end MLFlow runs.

Benefits and Usage of the MLFlow Package

The MLFlow package offers several benefits that align with the evaluation framework’s goals:

Standardization: Provides a consistent approach to experiment management, metric logging, and result visualization.
Extensibility: Can be customized and extended to fit specific evaluation needs and scenarios.
Integration: Seamlessly integrates with the existing RAG pipeline evaluation framework, enhancing MLFlow’s capabilities.
Efficiency: Simplifies repetitive tasks, saving time and effort.

Evaluation package step-by-step guide

Step1: Setup Evaluation package for the RAG pipeline

Follow the steps below to install the best evaluation package for LLM based RAG pipeline.

Get the code base: Clone the repo, cd to the directory 5. LLM Model Evaluations/I. LLM Eval Toolkit/Eval_Package
Install dependencies

pip3 install -r setup.txt

3. Set up the JSON configuration file for evaluating models. The first part is the model config file containing values of your API key, the name of the model, etc.

{
  "model": {
    "watsonxai_token": "Bearer Your-API-Key",
    "model_name": "google/flan-t5-xxl"
  },
  ...
}

The second part is the data config file containing parameters related to the input data and evaluation.

{
  ...
  "data": {
    "data_path": "path/to/dataset",
    "question": "instruction",
    "context": "input",
    "idea_answer": "output",
    "q_num": 5
  },
  ...
}

Finally, configure the result config part to define the file where the evaluation results will be saved.

4. Run the evaluation script

python3 eval_script.py

Setting up Evaluation package

Step2: Infuse MLFlow package in the RAG pipeline

These steps will quickly guide you through the process of infusing MLFlow package to easily manage and track evaluation experiments for your RAG pipeline.

1. Install dependencies:

python3 setup.py install

2. Import the library

import mlflow_utils

3. Set the path of your evaluation directory: Set the path of the directory which contains the evaluation results from the RAG pipeline

path="set the path of your evalution result DIR"
mlflow_utils.run_mlflow(path)

4. Start the MLFLow dashboards locally

mlflow ui

You can observe the leaderboard on http://127.0.0.1:5000

Conclusion and Recommendations

The SuperKnowa Model evaluation package offers a robust and insightful process to understand the model’s performance, strengths, and areas for improvement. By employing various datasets, models, evaluation metrics, and experiment parameters, this evaluation package serves as a comprehensive tool for researchers, data scientists, and practitioners. Here are some recommendations for those looking to explore further:

Key takeaways include:

Versatility: The ability to evaluate the RAG pipeline across different domains and contexts.
Comparative Analysis: Understanding how the LLM Model stacks against other state-of-the-art models.
In-Depth Insights: Gaining granular insights into model behavior and performance.
Scalability: Adapting the evaluation process to different models and tasks.

The SuperKnowa Model evaluation within the RAG pipelines framework exemplifies a structured and methodical approach to model assessment. Explore the provided notebook, datasets, and guides to delve deeper into the evaluation process and harness the full potential of the SuperKnowa Model.

Thanks to Sahil Desai for developing the evaluation package of our RAG pipeline. The full code and implementation details can be found on the GitHub Repo.

Follow Towards Generative AI for more content related to the latest in AI advancement.