RAG — Advanced Methods and Evaluation Framework

Introduction

Pradeep Goel
5 min readJan 27, 2024

Retrieval Augmented Generation is becoming a popular way to augment any LLM of your choice with your proprietary unstructured data. A Large language model is pre-trained on a corpus of texts sourced from publicly available information on the internet and does not have any context of enterprise-specific knowledge. RAG enables adding information via PDF, CSV, doc, markdown, and so on. In a simple RAG pipeline, information retrieval from your data source is limited by the query. Newer processes are emerging to make this retrieval process better and thus improve the quality of responses. Here we will discuss Sentence Window, and Auto Merging augmentation techniques to retrieve better information from retrieval store and Trulens that adds evaluation capabilities such as Answer relevance, context relevance, and groundedness.

Enhancing retrieval to send better context

Sentence Window

Here a document is split into single sentences and each sentence is embedded using an embedding model. When a sentence is selected for retrieval, surrounding sentences are also returned with the selected sentence. The sentence window can be configured to increase the number of surrounding sentences. This enables improving the context that is being sent to a LLM. However, an ever-increasing sentence window may not improve response but cause an increase in the cost. TruLens can help evaluate the right size of the sentence window.

Sentence Window Retrieval — image by author

Auto Merging

Fragmentation is one of the challenges while retrieving small chunks matching the query. Auto-merging addresses this issue, documents are broken into smaller chunks say nodes, and these nodes are stored hierarchically. For example, we set up two levels of hierarchial structure and first level is made of parent nodes of 2048 bytes, and the second level of the children or leaf nodes is made of 512 bytes. This means that a document is broken into chunks of 2048 bytes and each chunk will be broken further into four chunks of 512 bytes. These leaf nodes are stored in the vector store and the query is matched against these leaf nodes. Hierarchical information i.e. parent nodes and their relationship with their leaf nodes is stored in a document store. Based on a defined threshold of leaf nodes matching with the query, a parent node is returned to provide better context to the LLM.
In addition to the above, a few more techniques such as query expansion and enhancing a query with a hypothetical answer are also emerging, and details about them can be found here.

Evaluation metrics

We have three components query, context, and response. Let us construct a set of metrics to measure changes in the quality of response and context for any given query. We will measure the following three metrics —

Relevance Metrics — image by author
  • Answer Relevance

How relevant the answer is for a given query?

  • Groundedness

Is the answer based on the retrieved context?

  • Context Relevance

How relevant the context is for a given query?

A person can evaluate all three components and score the relevance but it is not scalable so we will set up a LLM to evaluate and score.

TruLens

TruLens is an open-source library that provides feedback functions for LLM evaluation. We will use these feedback functions to measure the defined metrics above.

Setting up TruLens

I ran pip install trulens-evalito install trulens. It installed 0.20.3 version that was throwing an error (“AttributeError: module ‘openai’ has no attribute OpenAI”). To fix the issue, install TruLens version 0.18.0 by executing pip install --upgrade trulens-eval==0.18.0

Build a TruLens object

from trulens_eval import Tru
tru = Tru()
tru.reset_database()

TruLens has feedback functions and these functions can be defined with “Chain of Thought” reasoning. LLM provides its reason behind the scoring when set up with COT reasoning that can be very useful to understand the score and improve it further.

Answer Relevance

To evaluate answer relevance, we are first setting up relevance parameter with cot reasons and instructing to evaluate on input (query) and output (answer). A score with the reasoning will be returned by f_qa_relevance .

#Answer Relevance
from trulens_eval import Feedback

f_qa_relevance = Feedback(
provider.relevance_with_cot_reasons,
name="Answer Relevance"
).on_input_output()

Context Relevance

The retrieved context will have multiple nodes or chunks returned from vector store retrieval. Store them in context_selection

# Build an object having all selected source nodes

from trulens_eval import TruLlama

context_selection = TruLlama.select_source_nodes().node.text

Setup feedback function to evaluate with COT reasons and instruct it to evaluate on input (query) and on the context (selected nodes). Use numpy to calculate average context relevance score.

# Context Relevance

import numpy as np

f_qs_relevance = (
Feedback(provider.qs_relevance_with_cot_reasons,
name="Context Relevance")
.on_input()
.on(context_selection)
.aggregate(np.mean)
)

Groundedness

Groundedness checks if the augmented answer is based on the context. The higher the use of the context in the response, the better the score is. Start with setting up a grounded object.

from trulens_eval.feedback import Groundedness

grounded = Groundedness(groundedness_provider=provider)

Set up a feedback function to evaluate with COT reasons and instruct it to evaluate the context (selected nodes) and output (response). A higher score means higher use of context and better answer.

# Groundedness
f_groundedness = (
Feedback(grounded.groundedness_measure_with_cot_reasons,
name="Groundedness"
)
.on(context_selection)
.on_output()
.aggregate(grounded.grounded_statements_aggregator)
)

Experimenting with sentence window retrieval and auto merging

In this short course, the initial sentence window is set as up to one, i.e. just adding one previous and one next sentence to the selected sentence. TruLense records all three feedback functions and it becomes the baseline. Next, the sentence window is increased to three and the context relevance improves significantly along with cost. Upon increasing the sentence window to five, context relevance improves marginally but cost increases rapidly thus suggesting that the sentence window with three previous and next sentences is optimal for this document.
For auto-merging experimentation, a node structure of two levels (Parent node of 2048 bytes and leaf node of 512 bytes) is set and feedback from TruLense is recorded. Upon setting up a level structure of three nodes ( Parent node of 2048 bytes, intermediate node of 512 bytes, and leaf node of 128 bytes) context relevance goes up and the total cost comes down.

Conclusion

Experimenting with Sentence windows and auto-merging techniques with the help of TruLens can help find the sweet spot where better responses are generated by LLMs with optimal cost. Other than the functions discussed here, TruLens also provides feedback functions to measure harmless ness such PII detection, toxicity, stereotyping, helpfulness such as concise, prompt sentiments, and so on.

References:

  1. LangChain Chat with Your Data
  2. Building and Evaluating Advanced RAG
  3. Advanced Retrieval for AI with Chroma

--

--

Pradeep Goel

I am an IT professional with a keen interest in emerging technologies and its usage to make things simpler and better.