Evaluating RAG with LlamaIndex

Csakash
9 min readFeb 18, 2024

--

Building a RAG pipeline and evaluating using engineering JIG and fixtures book

play with this collab link and dataset

What is the RAG and LLama Index?

Welcome to the fascinating realm of RAG (Retrieval-Augmented Generation) in conjunction with the innovative evaluation tool, LlamaIndex. RAG, a cutting-edge approach in natural language processing, integrates retrieval and generation techniques to enhance the quality of text generation systems. By seamlessly blending information retrieval with language generation, RAG has opened doors to new possibilities in tasks such as question-answering, summarization, and more. In this introduction, we embark on a journey to explore the intricacies of RAG methodology and its synergistic relationship with LlamaIndex, a benchmarking tool designed to evaluate the efficacy and performance of language models. Together, let’s uncover the nuances of RAG and its dynamic interaction with LlamaIndex, shaping the landscape of modern language understanding and generation.

In this journal, we’ll delve into crafting an RAG pipeline and assessing it using LlamaIndex. This exploration unfolds through three distinct segments.

  1. Understanding Retrieval Augmented Generation (RAG).
  2. Building RAG with LlamaIndex.
  3. Evaluating RAG with LlamaIndex.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) represents a pivotal advancement in language models (LLMs), which are traditionally trained on extensive datasets devoid of personalized or domain-specific information. RAG addresses this limitation by dynamically integrating user-specific data into the generation process, without necessitating modifications to the underlying training data of LLMs. Instead, it empowers the model to access and incorporate user data in real-time, thereby furnishing more contextually relevant and personalized responses.

Imagine chatting with a virtual assistant to plan a vacation. Traditional models offer generic suggestions. However, with Retrieval-Augmented Generation (RAG), when you mention preferences like budget or dietary restrictions, it instantly incorporates that data to provide personalized recommendations, ensuring your trip is tailored to your needs in real-time.

Within the framework of RAG, user data undergoes a process of loading and preparation, commonly referred to as indexing. This indexed data then becomes accessible for user queries, facilitating the filtration of relevant contextual information. Subsequently, the user query, along with the contextualized data, is presented to the LLM alongside a prompt, soliciting a response tailored to the specific context.

Whether you’re developing a chatbot or an agent, understanding RAG techniques is essential for seamlessly integrating data into your application and enhancing its contextual relevance.

RAG’s Five Fundamental Phases

RAG encompasses five pivotal stages essential for any comprehensive application:

1. Acquisition: This phase involves retrieving your data from various sources, such as text files, PDFs, websites, databases, or APIs, and integrating it into your pipeline. LlamaHub offers a wide array of connectors for this purpose.

2. Indexing: Here, a data structure is created to facilitate data querying. For LLMs, this typically involves generating vector embeddings and employing various metadata strategies to ensure accurate retrieval of contextually relevant information.

3. Storage: Once data is indexed, it’s imperative to store the index and associated metadata to prevent the need for repeated indexing.

4. Querying: Utilizing LLMs and LlamaIndex data structures, this phase involves employing diverse querying strategies, including sub-queries, multi-step queries, and hybrid approaches, based on the chosen indexing strategy.

5. Assessment: Evaluation is crucial for gauging the efficacy of the pipeline relative to alternative strategies and for monitoring performance following any modifications. It furnishes objective metrics regarding the accuracy, fidelity, and efficiency of query responses.

Building a RAG system using JIG fixtures and data

!pip install llama-index

import nest_asyncio

nest_asyncio.apply()

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.llms.openai import OpenAI

import os
import pandas as pd

Set OpenAI API key

os.environ['OPENAI_API_KEY'] = 'sk'
# adding jig and fixtures data to memory and passing the path
documents = SimpleDirectoryReader("/content").load_data()


# Define an LLM
llm = OpenAI(model="gpt-4")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

you can even use

Building a query engine

query_engine = vector_index.as_query_engine()

# By default it retrieves two similar nodes/ chunks. You can modify that in
# vector_index.as_query_engine(similarity_top_k=k).


response_vector = query_engine.query("What is the percentage of the tight part tolerance must be applied to the tool")
response_vector_1 = query_engine.query("What is duplicate locating jigs and fixtures design")

checking the response

response_vector.response

Let’s check the text in each of these retrieved nodes.

response_vector.source_nodes[0].get_text()

# Second retrieved node
response_vector.source_nodes[1].get_text()

Having constructed an RAG pipeline, the next step is evaluating its performance. We can gauge the effectiveness of our RAG system and query engine by employing LlamaIndex’s foundational evaluation modules. Let’s explore the utilization of these resources to measure the efficacy of our retrieval-augmented generation system.

Evaluation process

Evaluation stands as the key gauge for appraising your RAG application’s effectiveness. It scrutinizes the pipeline’s ability to generate precise responses across diverse data sources and query types.

Initially, analyzing individual queries and responses is helpful, but as the volume of edge cases and failures grows, this method may become unwieldy. Alternatively, establishing a set of summary metrics or automated evaluation methods can offer a more efficient approach. These tools offer insights into overall system performance and pinpoint specific areas warranting closer examination.

In a RAG system, evaluation focuses on two critical aspects:

  • Retrieval Evaluation: This assesses the accuracy and relevance of the information retrieved by the system.
  • Response Evaluation: This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.

Generating Question-Context Pairs:

qa_dataset = generate_question_context_pairs(
nodes,
llm=llm,
num_questions_per_chunk=2
)

To effectively evaluate an RAG system, it’s crucial to generate queries capable of retrieving accurate context and generating suitable responses. LlamaIndex provides a specialized module called generate_question_context_pairs designed precisely for crafting question and context pairs. These pairs serve as valuable assets in assessing the performance of the RAG system, aiding both in retrieval and response evaluation. For further insights on Question Generation, please consult the Llama Index documentation.

Retrieval Evaluation:

Our next step involves initiating retrieval evaluations. We’ll employ the RetrieverEvaluator with the evaluation dataset we’ve prepared.

retriever = vector_index.as_retriever(similarity_top_k=2)

Initially, we’ll instantiate the Retriever and define two key functions: get_eval_results, responsible for executing our retriever on the dataset, and display_results, which showcases the evaluation outcomes.

Define RetrieverEvaluator. We use Hit Rate and MRR metrics to evaluate our Retriever.

Hit Rate:

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

Mean Reciprocal Rank (MRR):

MRR serves as a metric for evaluating the accuracy of a system by examining the rank of the highest-placed relevant document for each query. It calculates the average of the reciprocals of these ranks across all queries. For instance, if the first relevant document is ranked highest, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so forth.

Observation:

When using the Retriever with OpenAI Embedding, we find that it successfully retrieves relevant documents 75.86% of the time, which is quite good. However, when we look at the Mean Reciprocal Rank (MRR), which measures how accurately the most relevant documents are ranked, we see a score of 0.6206. This means there’s room for improvement because the top-ranked results aren’t always the most relevant. To enhance accuracy, we could consider using rerankers to reorder the retrieved documents for better relevance.

retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)

def display_results(name, eval_results):
"""Display results from evaluate."""

metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)

full_df = pd.DataFrame(metric_dicts)

hit_rate = full_df["hit_rate"].mean()
mrr = full_df["mrr"].mean()

metric_df = pd.DataFrame(
{"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
)

return metric_df
display_results("OpenAI Embedding Retriever", eval_results)

Evaluation of response :

  1. FaithfulnessEvaluator: Measures if the response from a query engine matches any source nodes which is useful for measuring if the response is hallucinated.
  2. Relevancy Evaluator: Measures if the response + source nodes match the query.
# Get the list of queries from the above created dataset

queries = list(qa_dataset.queries.values())
print(queries)

Faithfulness Evaluator

Let’s start with FaithfulnessEvaluator.

# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# gpt-3.5-turbo-16k-0613
gpt4 = OpenAI(temperature=0, model="gpt-3.5-turbo-16k-0613")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

#gpt-4
gpt35T = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt35T)

We will use gpt-3.5-turbo it for generating responses for a given query gpt-3.5-turbo-16k-0613 and gpt-4 for evaluation.

Let’s create service_context separately for gpt-3.5-turbo and gpt-4.

#Create a QueryEngine with gpt-3.5-turbo service_context to generate response for the query.
vector_index = VectorStoreIndex(nodes, service_context = service_context_gpt35)
query_engine = vector_index.as_query_engine()
#Create a FaithfulnessEvaluator
from llama_index.core.evaluation import FaithfulnessEvaluator
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)

eval_query = queries[10]

print(eval_query)
#Generate response first and use faithfull evaluator.
response_vector = query_engine.query(eval_query)

# Compute faithfulness evaluation
eval_result = faithfulness_gpt4.evaluate_response(response=response_vector)

# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing

Relevancy Evaluator:

RelevancyEvaluator is useful to measure if the response and source nodes (retrieved-context) match the query. Useful to see if the response answers the query.

from llama_index.core.evaluation import RelevancyEvaluator
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)

Instantiate RelevancyEvaluator for relevancy evaluation with gpt-4

# Pick a query
query = queries[10]

print(query)

# Generate response.
# response_vector has response and source nodes (retrieved context)
response_vector = query_engine.query(query)

# Relevancy evaluation
eval_result = relevancy_gpt4.evaluate_response(
query=query, response=response_vector
)

#result
'''
"Describe the key features of the book 'Design of Jigs, Fixtures and Press Tools' by K. Venkataraman and explain how it is beneficial for undergraduate students in mechanical engineering and production/manufacturing engineering?"
'''
# You can check passing parameter in eval_result if it passed the evaluation.
result = eval_result.passing
print(result)

# You can get the feedback for the evaluation.
feedback = eval_result.feedback
print(feedback)

Batch Evaluator:

Now that we have done FaithFulness and Relevancy Evaluation independently. LlamaIndex has BatchEvalRunner to compute multiple evaluations in batch batch-wise manner.

from llama_index.core.evaluation import BatchEvalRunner

# Let's pick top 10 queries to do evaluation
batch_eval_queries = queries[:10]

# Initiate BatchEvalRunner to compute FaithFulness and Relevancy Evaluation.
runner = BatchEvalRunner(
{"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
workers=8,
)

# Compute evaluation
eval_results = await runner.aevaluate_queries(
query_engine, queries=batch_eval_queries
)
# Let's get faithfulness score

faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])

print(faithfulness_score)

# Let's get relevancy score

relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])

print(relevancy_score)

Observation:

Achieving a faithfulness score of 0.9 is like hitting the jackpot — it means our generated answers are as solid as a rock, free from any wild guesses or flights of imagination.

Similarly, boasting a relevancy score of 0.8 is akin to acing a precision test — our answers are consistently spot-on, perfectly mirroring the context retrieved and the queries posed.

Conclusion:

In this notebook, we embarked on an exhilarating journey delving into the intricate art of crafting and scrutinizing a RAG pipeline using the remarkable toolset of LlamaIndex. With laser focus, we honed in on dissecting the retrieval system’s prowess and dissecting the responses generated within our pipeline.

Reference:

LlamaIndex offers a variety of other evaluation modules as well, which you can explore further here

--

--