The aRt of RAG Part 4: Retrieval evaluation

Ross Ashman (PhD)
7 min readMay 30, 2024

--

Photo by David Travis on Unsplash

Introduction

What is the job of the retriever? Its job is to find relevant information, or put another way, context, for our LLM to form an answer. Without this context the LLM is useless. Without the right context the LLM is also more or less useless. The retriever plays a vital part in utilising LLM’s on unseen or private data. RAG is (Information Retrieval + LLM).

In parts 1 to 3 we looked at extracting information from documents via OCR, creating a MongoDB repository for our OCR’ed data, and adding Atlas search using both Lexical and Semantic indexes. We also looked at the use of cross encoders for reranking our results. The question we had at the end of this was; how well does our system perform?

Understanding Metrics in Retrieval Evaluation:

To gauge the efficacy of our retrieval system, we need to test our system against some appropriate metric. This is where things get complicated. What do we mean when we say appropriate? The goal is to retrieve relevant information. Does this mean a single exact match, a ranked list of similar matches, or a list of something vaguely related to a query? You can think of these senarios as retrieving specific or broad context for our LLM to deal with. Which one, will be a function of the use case.

Thankfully evaluating retrieval systems is a pretty well trodden problem in the field of Information Retrieval and there are a bunch of metrics that have been developed that we can use. We can break them down into our use-case scenarios.

Binary retrieval

In this scenario we want to retrieve a document with an exact or very close word for word match. Metrics for this scenario are those we would use for classification eg Precision, Recall and F1

Paraphrase retrieval

Very similar to Binary retrieval except now we have documents that are similar in meaning but very different wording. Again we can use Precision, Recall and F1

Ranked list retrieval

In this scenario we want to find a ranked list of documents

  • Precision@K: Precision measures the proportion of retrieved documents that are relevant. Precision@K evaluates precision at a specific rank K in the retrieved list. It’s computed as the number of relevant documents among the top K retrieved documents divided by K.
  • Recall@K: Recall measures the proportion of relevant documents that are retrieved. Recall@K evaluates recall at a specific rank K in the retrieved list. It’s computed as the number of relevant documents among the top K retrieved documents divided by the total number of relevant documents.
  • Mean Average Precision (MAP): MAP computes the average precision across all queries. It considers both the precision and the rank of relevant documents in the retrieved list.
  • Mean Reciprocal Rank (MRR): MRR calculates the average reciprocal of the rank of the first relevant document retrieved. For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on. It’s particularly useful when only the first relevant document matters.
  • Normalised Discounted Cumulative Gain (NDCG): NDCG measures the quality of the ranked list by considering the graded relevance of retrieved documents. It’s computed based on the positions of relevant documents in the ranked list.
  • Hit Rate: It measures the proportion of queries for which at least one relevant document is retrieved among the top K results. The hit rate is particularly useful when evaluating the system’s ability to provide relevant information to users within a limited number of retrieved documents.

Where do we start?

In order to perform our evaluation we need the following:

  1. Data: A dataset consisting of queries, their corresponding relevant documents or passages. Optionally, you can also have a separate set of queries with their corresponding irrelevant documents to evaluate the retriever’s ability to filter out irrelevant information.
  2. Baseline Model: For each query in the evaluation dataset, use the the model in the retriever to retrieve a set of documents or passages from the corpus. I use the word Model here to cover both sparse and dense retrievers. Typically, the retriever returns a ranked list of documents, with the most relevant ones ranked higher. The baseline model is the one chosen as a reference.
  3. Annotation: Annotations of the relevance of the retrieved documents or passages for each query. Relevance annotations could be binary (relevant or not relevant) or graded (e.g., highly relevant, somewhat relevant, not relevant)

Annotated Data

Table from “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models” showing datasets for retrieval benchmarking

As we are evaluating pre-trained embedding models we are going to use the Benchmark for Zero-shot evaluation of information retrieval models. This benchmark includes a number of annotated datasets for various scenarios. For completeness we are going to transform a dataset into the BEIR format to get an idea of how we could do this with our own data. The data we will use is the annotated SICK (Sentences Involving Compositional Knowledge) data set.

Which model is best?

In the BEIR paper they use BM25 as their baseline model.

Unfortunately, various papers have reported conflicting results for BM25 on the same BEIR benchmark datasets. The BM25 effectiveness can vary due to different hyperparameters and different linguistic processing methods used in different system implementations, such as removing stop words, stemming, and tokenization.

Be aware that your results may not directly match with those published for the same dataset. So we can only really do a relative comparison, hence the use of our own baseline model. Whatever baseline model you are using, stay consistent across your evaluation process.

Building BEIR dataset

Ok lets get started. Once we have the SICK (Sentences Involving Compositional Knowledge) dataset we convert it to the BEIR format. There are four functions we will use:

import pandas as pd
import csv
from beir.datasets.data_loader import GenericDataLoader

def to_corpus(doc_id, text, title):
if len(title)==0:
title = [""]*len(text)
h = list(zip(doc_id, title, text))
corpus = [{'_id': d, 'title':k, 'text':v} for d, k,v in h]
return corpus

def to_queries(qid, text):
queries = [{'_id':k[0],'text':k[1]} for k in list(zip(qid, text))]
return queries

def to_qrels(dups):
#map queries to docs
dups['dicts'] = dups.apply(lambda x: dict(zip(x['doc_id'], x['relatedness_score'])),axis=1)
dups2 = dups.explode(['doc_id','relatedness_score'])
qrels = dups2[['qid','doc_id','relatedness_score']]
qrels.rename(columns={'relatedness_score': 'score'}, inplace=True)
return qrels

def to_jsonl(data, filename):
import json
with open(filename, 'w') as outfile:
for entry in data:
json.dump(entry, outfile)
outfile.write('\n')

if __name__ == "__main__":

filen = '/retriever_evaluation/SICK/SICK.txt'
df = pd.read_csv(filen, quoting=csv.QUOTE_NONE, doublequote=True, sep='\t')
df['doc_id'] = df['pair_ID'].apply(lambda x: 'doc'+str(x))
df['relatedness_score'] = df['relatedness_score'].astype(float)
corpus = to_corpus(df['doc_id'].to_list(), df['sentence_B'].to_list(), title=[])

dups = df.groupby('sentence_A', as_index=False).agg(list)
dups["myindex"] = dups.index
dups['qid'] = dups["myindex"].apply(lambda x: 'q'+str(x))
queries = to_queries(dups['qid'].to_list(), dups['sentence_A'].to_list())

qrels = to_qrels(dups)
qrels['score'] = qrels['score'].apply( lambda x: round(x) ) #astype(int)

corpus_path = "/retriever_evaluation/BEIR/trec-sick/corpus.jsonl"
to_jsonl(corpus, corpus_path)
query_path = "/retriever_evaluation/BEIR/trec-sick/queries.jsonl"
to_jsonl(queries, query_path)
qrels_path = "/retriever_evaluation/BEIR/trec-sick/qrels/test.tsv"
qrels.to_csv(qrels_path, sep='\t', index=False)

corpus, queries, qrels = GenericDataLoader(
corpus_file=corpus_path,
query_file=query_path,
qrels_file=qrels_path).load_custom()

The code above will generate three files: a corpus file, a query file, and a qrels file. We then use the GenericDataLoader to read in those files to generate our variables corpus, queries and qrels.

Evaluation

Now that we have our dataset we can perform our evaluation. We will use three functions for this. The first creates the retriever, the second performs the queries, and the third performs the evaluation.

Creating a retriever

For this we will use LLamaIndex. We will also use ChromaDB for the vector store, so before running this code we will need to install ChromaDB .

def create_retriever(documents, model_name):
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import VectorStoreIndex, StorageContext, ServiceContext
from llama_index.vector_stores.chroma import ChromaVectorStore

if model_name == 'bm25':
nodes = SentenceSplitter().get_nodes_from_documents(documents)

# We can pass in the index, doctore, or list of nodes to create the retriever
retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=10)
return retriever

else:
if '/' in model_name:
name = model_name.split('/')[1]
else:
name = model_name
import chromadb
chroma_client = chromadb.EphemeralClient()

chroma_collection = chroma_client.get_or_create_collection(name=name)
chroma_client.delete_collection(name=name)
chroma_collection = chroma_client.create_collection(name=name, metadata={"hnsw:space": "cosine"})

embed_model = HuggingFaceEmbedding(model_name)
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
nodes = SentenceSplitter().get_nodes_from_documents(documents)
index = VectorStoreIndex(
nodes=nodes, service_context=service_context, storage_context=storage_context,show_progress=True
)
retriever = index.as_retriever(similarity_top_k=10)

return retriever

Query engine

Once we have our retriever, we can perform our queries and capture results. The code for the query engine is:

def query(queries, retriever, node_postprocessors: Optional[List[BaseNodePostprocessor]] = None):
results = {}
for key, query in tqdm.tqdm(queries.items()):
nodes_with_score = retriever.retrieve(query)
node_postprocessors = node_postprocessors or []
for node_postprocessor in node_postprocessors:
nodes_with_score = node_postprocessor.postprocess_nodes(
nodes_with_score, query_bundle=QueryBundle(query_str=query)
)
results[key] = {
node.node.metadata["doc_id"]: node.score
for node in nodes_with_score
}
return results

Evaluation function

Finally, once we have our query results we can evaluate how well our retriever performs. The function below evaluates the results from the retriever and returns NDCG@k, MAP@k, Recall@k, Precision@k and F1@k

def evaluate(model_name, qrels, results, metrics_k_values):
"""
A bunch of useful reference links
https://github.com/beir-cellar/beir/blob/main/beir/retrieval/evaluation.py
https://github.com/run-llama/llama_index/blob/ccc0b85a19eba681657f2615540e5ddfd6152505/llama-index-core/llama_index/core/evaluation/retrieval/evaluator.py
https://github.com/run-llama/llama_index/blob/ccc0b85a19eba681657f2615540e5ddfd6152505/llama-index-core/llama_index/core/evaluation/benchmarks/beir.py#L100
https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/benchmarks/beir.py
https://docs.llamaindex.ai/en/stable/examples/evaluation/retrieval/retriever_eval/
https://docs.llamaindex.ai/en/stable/examples/evaluation/BeirEvaluation/
https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/

"""
from beir.retrieval.evaluation import EvaluateRetrieval

print("Evaluating retriever on questions against qrels")

ndcg, map_, recall, precision = EvaluateRetrieval.evaluate(
qrels, results, metrics_k_values
)
print("Results for:", model_name)
for k in metrics_k_values:
print(
{
f"NDCG@{k}": ndcg[f"NDCG@{k}"],
f"MAP@{k}": map_[f"MAP@{k}"],
f"Recall@{k}": recall[f"Recall@{k}"],
f"Precision@{k}": precision[f"P@{k}"],
f"F1@{k}": 2*precision[f"P@{k}"]*recall[f"Recall@{k}"]/(precision[f"P@{k}"]+recall[f"Recall@{k}"]),
}
)
print("-------------------------------------")

file_output_data = {}
for k in metrics_k_values:
file_output_data[f"NDCG@{k}"]=ndcg[f"NDCG@{k}"]
file_output_data[f"MAP@{k}"]= map_[f"MAP@{k}"]
file_output_data[f"Recall@{k}"]= recall[f"Recall@{k}"]
file_output_data[f"Precision@{k}"]= precision[f"P@{k}"]
file_output_data[f"F1@{k}"]= 2*precision[f"P@{k}"]*recall[f"Recall@{k}"]/(precision[f"P@{k}"]+recall[f"Recall@{k}"])



return file_output_data

We can use these functions to evaluate any huggingface model we like against the LLamaIndex implementation of BM25. Bringing them together looks like this:

corpus, queries, qrels = ev.GenericDataLoader(data_path).load(split="test")

documents = []
for _id, val in corpus.items():
doc = Document(
text=val["text"], metadata={"title": val["title"], "doc_id": _id}
)
documents.append(doc)

model_names = ['distilbert-base-nli-stsb-mean-tokens', 'BAAI/bge-base-en', 'msmarco-distilbert-base-v3', 'bm25']
model_name = model_names[2]
retriever = ev.create_retriever(documents, model_name)
search_results = ev.query(queries, retriever)
metrics_k_values=[1, 3, 5, 10]
eval_results = ev.evaluate(model_name, qrels, search_results, metrics_k_values)

Summary and final thoughts

Evaluating retrieval systems is not a trivial task and there are plenty of gotchas. It also doesn’t seem to get the attention it deserves, which was the motivation for this article. There has been a proliforation of evaluation frame works. However, I think understanding the hidden complexities means we can ask some better informed questions around their advantages and disadvantages.

--

--

Ross Ashman (PhD)

Data Scientist Lead, AI/ML/DL, Unstructured data specialist