Efficient RAG Model Assessment Using RAGAS

Stan
9 min readMay 20, 2024
photo is from unsplash.com

Motivation

To deploy a Retrieval-Augmented Generation (RAG) model for production, we need clearly defined metrics to ensure the model’s stable performance. This allows us to monitor the model’s performance and decide when to retrain or rebuild it. However, unlike traditional regression or classification machine learning models, it is challenging to establish clear metrics for language models. The main reasons are:

  • It is difficult to have a labeled ground truth dataset for language models. Human labels are accurate but time-consuming and hard to scale.
  • There are no well-established metrics for RAG.

In this article, we will address these two issues using recent progress on ground truth labelers and the RAGAS library.

Intuition and Steps to Create Ground Truth Dataset

Ground Truth Labeler: Since natural language is difficult for humans to label, we could use the latest large language models, such as GPT-4, to act as “language experts” to unerstand and label our dataset. The dataset is curated using RAG-created question and context pairs.

  1. Select data chunk (context) for RAG
  2. Generate questions based on the chunks (questions) using GPT3.5

--

--

Stan

A director data scientist working in a tech start-up who is passionate about making a positive impact on people around him