Frameworks in Focus: ‘Building and Evaluating Advanced RAG’ with TruLens and LlamaIndex Insights

Lakshmi narayana .U
10 min readJan 25, 2024
Rembrandt’s Vision of a Futuristic AI Lab: image created by author with DALL.E-3

Introduction

Retrieval Augmented Generation (RAG) is a useful tool for making Large Language Models (LLMs) much smarter. They help LLMs use specific user data to give better answers. RAG right now, is important because we currently have some challenges when trying to fit whole documents into the context window. These challenges include a limit on the length of the model inputs, an increase in computational cost, and problems like “lost in the middle”. This last issue refers to when models have trouble using information found in the middle of a long input context.

Source: Lost in the Middle: How Language Models Use Long Contexts Research Paper

… as the context window starts becoming bigger and bigger there could be a way to sidestep RAG completely and build on content dump and custom instructions.

However, till RAG is relavent, to develop an effective RAG system, it is essential to employ appropriate methods to extract the most useful information for the LLM to utilize. Additionally, implementing a robust evaluation framework to assess the performance of our RAG is imperative, both during the initial development and subsequent usage stages. This is where the role of RAG Evaluation frameworks becomes critical.

Several RAG Evaluation Frameworks are currently available in the market, including DeepEval, MLFlow, RAGAs, Deepchecks, Arize AI, and TruLens. It’s crucial to note that the selection of the right evaluation metrics and the procurement of high-quality validation data is an area of active research. Given the rapid evolution in this field, we are witnessing the emergence of various approaches for RAG evaluation frameworks. These encompass the RAG Triad of metrics, ROUGE, ARES, BLEU, and RAGAs. This article will predominantly concentrate on the evaluation of a RAG pipeline utilizing the RAG Triad of metrics and TruEval.

Evaluations for Retrieval Augmented Generation using TruLens LLama Index

We will delve into the comprehensive course from Deeplearning.ai — Building and Evaluating Advanced RAG — to gain a profound understanding of constructing an Advanced RAG Pipeline and subsequently evaluating its performance utilizing TruLens.

This course explores high-level strategies for fine-tuning Retrieval Augmented Generation (RAG) systems. These systems are key to helping Large Language Models (LLMs) answer questions using user data. The course is taught by Jerry Liu, who is the co-founder and CEO of LlamaIndex, and Anupam Datta, who is the co-founder and Chief Scientist of a company called TruEra.

Andrew NG, Jerry Liu and Anupam Datta

Key Components of the Course

1. Deep dive into two Advanced Retrieval Techniques:

- Sentence Window Retrieval: This method enhances the context provided to the LLM by retrieving a window of sentences around the most relevant sentence, rather than just the sentence itself. This approach improves the LLM’s understanding of the context.

- Auto-Merging Retrieval: Organizes documents into a hierarchical, tree-like structure. If multiple child nodes (smaller text chunks) are relevant to a query, the entire parent node (larger text chunk) is retrieved. This method dynamically generates more coherent text chunks than simpler retrieval methods.

2. Evaluation Framework — The RAG Triad:

- Context Relevance: Measures the relevance of retrieved text chunks to the user’s query, aiding in debugging and refining the retrieval process.

- Groundedness: Assesses how well the LLM’s response is supported by the retrieved context.

- Answer Relevance: Evaluates the relevance of the LLM’s response to the original query.

3. Systematic Iteration and Improvement:

- The course emphasizes a systematic approach to building and refining QA systems, akin to error analysis in machine learning. This methodology enhances efficiency in developing reliable QA systems.

- Hands-on practice is provided for iterating these retrieval methods and evaluation metrics.

- Systematic experiment tracking is taught to establish baselines and facilitate rapid improvement.

4. Practical Application and Tuning Advice:

- Insights and tuning suggestions are shared based on experiences from partners who have been building RAG applications.

Let’s now look into the salient features of the course, along with an analysis of my exercise workbook.

Tips for running the course notebooks

One of the challenges I encountered while running the notebooks on my computer was frequent attribute errors. Further, I tried on Google Colab as well and got the same errors as under.

Source: Author’s Google Colab
Source: Author’s Google Colab

The only way I could test the code from the course was to run it on their environment.

For all the tests, I used my book ‘Directing Business’ as a source.

Source: Author’s exercise workbook

Evaluation Framework — The RAG Triad of Metrics

Source: Course from Deeplearning.ai, and TruLens website

Constructing the RAG Triad

1. Context Relevance: Assesses the quality of retrieved context in relation to the user’s query.

2. Groundedness: Measures how well the RAG’s final response is supported by the retrieved context.

3. Answer Relevance: Evaluates the relevance of the RAG’s final response to the original user query.

Each metric is implemented as a feedback function*, using OpenAI’s GPT-3.5 as the provider for evaluation. These functions not only score responses but also provide supporting evidence or chain of thought reasoning behind the scores.

*A feedback provides a score after reviewing an LLM app’s inputs, outputs and intermediate results

Source: Course from Deeplearning.ai

Advanced RAG Technique — Sentence Window Retrieval Method

This method is designed to improve the matching of relevant context during retrieval and subsequently enhance the synthesis of answers.

Fundamentals of Sentence Window Retrieval

The key innovation in this method lies in its approach to handling text chunks for embedding and synthesis. Unlike the standard RAG pipeline, which uses the same chunk size for both, the sentence window retrieval decouples them:

  • Smaller Chunks for Embedding: For the embedding process, smaller chunks or sentences are used. These are stored in a vector database with added context from adjacent sentences.
  • Expanded Context for Synthesis: During retrieval, the most relevant sentences are fetched based on similarity search. Then, instead of using just these sentences, an expanded context window surrounding these sentences is provided to the LLM for synthesizing the answer.

Key steps in setting up the Sentence Window Retriever

  1. Implementing the Node Parser: A sentence window node parser divides a document into individual sentences and augments each sentence chunk with surrounding context.
from llama_index.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)

2. Building the Index: The index is built using OpenAI’s GPT-3.5 Turbo and a service context object containing the LLM, embedding model, and node parser.

from llama_index import ServiceContext

sentence_context = ServiceContext.from_defaults(
llm=llm,
embed_model="local:BAAI/bge-small-en-v1.5",
# embed_model="local:BAAI/bge-large-en-v1.5"
node_parser=node_parser,
)
from llama_index import VectorStoreIndex

sentence_index = VectorStoreIndex.from_documents(
[document], service_context=sentence_context
)
sentence_index.storage_context.persist(persist_dir="./sentence_index")

3. Query Engine and Post-Processing: The query engine is set up with a metadata replacement post-processor to replace node text with full context. Additionally, a sentence transformer re-rank model is applied to re-order nodes based on relevance.

from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

postproc = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
from llama_index.schema import NodeWithScore
from copy import deepcopy

scored_nodes = [NodeWithScore(node=x, score=1.0) for x in nodes]
nodes_old = [deepcopy(n) for n in nodes]
replaced_nodes = postproc.postprocess_nodes(scored_nodes)
#Adding a re-ranker
from llama_index.indices.postprocessor import SentenceTransformerRerank

# BAAI/bge-reranker-base
# link: https://huggingface.co/BAAI/bge-reranker-base
rerank = SentenceTransformerRerank(
top_n=2, model="BAAI/bge-reranker-base"
)
from llama_index import QueryBundle
from llama_index.schema import TextNode, NodeWithScore

query = QueryBundle("I want a dog.")

scored_nodes = [
NodeWithScore(node=TextNode(text="This is a cat"), score=0.6),
NodeWithScore(node=TextNode(text="This is a dog"), score=0.4),
]
reranked_nodes = rerank.postprocess_nodes(
scored_nodes, query_bundle=query
)

4. Evaluation with TrueLens: The final part of the setup involves evaluating the retriever using TrueLens and the RAG triad on a set of evaluation questions, focusing on experimenting with parameters and assessing the impact on performance.

from trulens_eval import Tru

def run_evals(eval_questions, tru_recorder, query_engine):
for question in eval_questions:
with tru_recorder as recording:
response = query_engine.query(question)
from utils import get_prebuilt_trulens_recorder

from trulens_eval import Tru

Tru().reset_database()

Evaluating Sentence Window Size and Trade-offs

- Gradually increasing the sentence window size and observing its effects on context relevance, groundedness, and answer relevance.

sentence_index_1 = build_sentence_window_index(
documents,
llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),
embed_model="local:BAAI/bge-small-en-v1.5",
sentence_window_size=1,
save_dir="sentence_index_1",
)
  • Balancing the trade-offs between the quality of the app (evaluation metrics) and the cost (token usage).
TruLens App Dashboard- Source: Author’s exercise workbook
TruLens App Dashboard- Source: Author’s exercise workbook

Note: Trulens also provides detailed json output for the app and records.

Advanced RAG Technique — Auto-Merging Retrieval

Auto-merging retrieval method, an advanced technique in Retrieval Augmented Generation (RAG) systems, is designed to address the issue of fragmented context chunks.

Understanding Auto-Merging Retrieval

The primary challenge in standard RAG pipelines is dealing with fragmented context chunks, especially when the chunk size is small. This fragmentation can hinder the Large Language Model’s (LLM) ability to effectively synthesize information from the retrieved context. The auto-merging technique aims to address this by:

  • Defining a Hierarchy: Creating a hierarchy of smaller chunks linked to larger parent chunks.
  • Merging During Retrieval: If a significant number of smaller chunks related to a parent chunk are retrieved, they are merged into a larger parent chunk. This approach ensures more coherent context is provided to the LLM.
Source: Course on Deeplearning.ai

Key Steps in Setting Up Auto-Merging Retrieval

  1. Hierarchical Node Parsing: Nodes are parsed hierarchically to establish relationships between smaller chunks and their parent chunks. Demonstrations show how to create toy parsers and extract nodes from documents.
from llama_index.node_parser import HierarchicalNodeParser

# create the hierarchical node parser w/ default settings
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)
nodes = node_parser.get_nodes_from_documents([document])
from llama_index.node_parser import get_leaf_nodes

leaf_nodes = get_leaf_nodes(nodes)
nodes_by_id = {node.node_id: node for node in nodes}
parent_node = nodes_by_id[leaf_nodes[30].parent_node.node_id]

2. Index Construction: An index is built using OpenAI’s GPT-3.5 Turbo, focusing on embedding the leaf nodes (smallest chunks) while maintaining a relationship with the parent nodes.

from llama_index import ServiceContext

auto_merging_context = ServiceContext.from_defaults(
llm=llm,
embed_model="local:BAAI/bge-small-en-v1.5",
node_parser=node_parser,
)
from llama_index import VectorStoreIndex, StorageContext

storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

automerging_index = VectorStoreIndex(
leaf_nodes, storage_context=storage_context, service_context=auto_merging_context
)

automerging_index.storage_context.persist(persist_dir="./merging_index")

3. Query Engine and Retrieval Logic: The query engine is set up with an auto-merging retriever controlling the merging logic. A re-rank module is also integrated to refine the retrieval process further.

from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

automerging_retriever = automerging_index.as_retriever(
similarity_top_k=12
)

retriever = AutoMergingRetriever(
automerging_retriever,
automerging_index.storage_context,
verbose=True
)

rerank = SentenceTransformerRerank(top_n=6, model="BAAI/bge-reranker-base")

auto_merging_engine = RetrieverQueryEngine.from_args(
automerging_retriever, node_postprocessors=[rerank]
)

4. Evaluation with TruLens: The auto-merging retriever is evaluated using the RAG triad metrics with TruLens, emphasizing experimentation with various parameters.

Iterating on Auto-Merging Parameters

Experimentation with different hierarchical structures and chunk sizes to optimize the RAG system.

- Adjust the number of layers and chunk sizes in the hierarchy.

- Evaluate different configurations using the RAG triad metrics.

- Observe the trade-offs between quality metrics and operational costs.

Source: Author’s exercise workbook
TruLens App Dashboard- Source: Author’s exercise workbook
TruLens App Dashboard- Source: Author’s exercise workbook

Advanced RAG Techniques: Evaluate and Iterate

Once the basic understanding is in place, evaluating and iterating is the key.

1. Start with llamaindex Basic RAG

2. Evaluate with TruLens RAG Triad

3. Iterate with llamaIndex Sentence Window / Auto-Merging

4. Re-evaluate with TruLens RAG Triad

5. Experiment with related hyperparameters (Window size/levels/etc)

Observations

TruLens App Dashboard- Source: Author’s exercise workbook

Curious why a question, which is out of context, receives a zero answer relevance score even when the response correctly indicates that the question is out of context?

  • It’s beneficial to combine both in-context and out-of-context questions to observe the overall behavior of the RAG Triad metrics.
  • The presenters recommend a minimum of 20 questions for personal projects and 100 for enterprise projects. Please note that high-performance computing resources may be required for these experiments.
  • Exploring how other vector stores integrate into this context could provide interesting insights.
  • It’s worth checking out the DeepEval framework, which offers modular components, unit testing, synthetic data generation, and a hosted platform.

Overall, the course — Building and Evaluating Advanced RAG — offers a comprehensive and practical approach to mastering advanced RAG techniques and evaluation methods, significantly contributing to the development of effective and efficient question-answering system.It provides the right balance of theory and hands-on activities to help us explore more complex experiments.

Related Links:

  1. Lost in the Middle: How Language Models Use Long Contexts: https://arxiv.org/pdf/2307.03172.pdf
  2. DeepEval: https://github.com/confident-ai/deepeval
  3. Building and Evaluating Advanced RAG-A short course from deeplearning.ai: https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/

--

--