Basic to Advanced RAG using LlamaIndex (Estimating optimal chunk size)~ 2
Welcome back to the series of Basic to Advanced RAG. We are moving forward with the second installment of the series, where we will find out the optimal chunk_size
of our Retrieval Augmented Generation (RAG) application.
First, let’s discuss why chunk_size
matters in the whole process, and then proceed with the implementation of the same.
The Importance of Chunk Size
Optimal chunk_size is a very critical parameter that can affect the efficiency and accuracy of our RAG application
Imagine you are making a RAG application over an app guide (user manual) consisting of information about a particular app's features. Since the information per topic or per feature will be small because user manuals are meant to be short and precise, using a larger chunk size will be useless because now we will be feeding irrelevant information to the LLM. Also, this application should be fast enough to respond to user queries quickly. As the chunk_size
increases, the volume of information will also increase, which will get fed to the LLM to generate an answer, which might slow down the system.
But let’s say you are making a RAG application over top of the research papers. Now, for a better response to the query as far as scientific research papers are concerned, using a shorter chunk_size
would fail because important information might not be present, and with the incomplete context, LLM won’t be able to respond accurately. In this application, response timing won’t be necessary, but the correctness and relevancy of the answer matter the most.
So now we can understand that the choice of chunk_size
parameter isn't solid but an application-based parameter where a balance needs to be found such that response time is maintained along with capturing vital information.
If you are new to the concept of RAG, please consider going through this theoretical blog to understand the logic behind using RAG
You can explore the other part of this series below:
Now let’s proceed to practical implementation for the evaluation of optimal chunk size.
But before that, I want to draw a quick roadmap on how we will be proceeding to evaluate the optimal chunk_size
.
1. Construct a vanilla RAG pipeline and keep every parameter (embedding model, LLM, vector database) constant but
chunk_size
variable.2. Use the question-answer pair from the source document, which will be used for the evaluation of RAG by some metrics, which will be converted in some time. If question-answer pairs are not available, don’t worry, one can generate them using LLM (indeed a lifesaver sometimes).
3. Experiment with different
chunk_size
, systematically varying them to observe the impact on performance. Measure the effectiveness of each configuration using the defined evaluation metrics.
Great!! now that we have a map ready, let’s dive into the code implementation
Set up the libraries
Start by installing the necessary libraries to handle LLM-based evaluation tasks, such as llama-index
, spacy
, and others. These libraries facilitate the loading, processing, and analysis of large text datasets.
!pip install llama-index
!pip install llama-index-llms-gemini
!pip install llama-index-embeddings-huggingface
!pip install spacy
Import Libraries
After installing the required packages, import essential libraries and modules like SimpleDirectoryReader
, VectorStoreIndex
, DatasetGenerator
, and the evaluation tools. This step sets up your environment for handling and processing large text documents effectively.
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
Settings,
)
from llama_index.core.evaluation import (
DatasetGenerator,
FaithfulnessEvaluator,
RelevancyEvaluator
)
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import os
import time
import nest_asyncio
nest_asyncio.apply()
In this blog, we’ve leveraged the power of Llama3.1 with 8B parameters, using the Ollama platform as our LLM. For more in-depth information about Llama3.1, be sure to check out the detailed blog linked below.
For embeddings, the bge-small-en-1.5 model is used throughout, but feel free to experiment with other options from the list below.
llm = Ollama(model="llama3.1")
Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Using Settings , you make the llm and embedding model as a global variables.
Load Data
Load your dataset using SimpleDirectoryReader
, targeting specific files for evaluation. In this case, the code focuses on a PDF file containing research on Gemma models. This data forms the basis for generating evaluation questions and performing assessments.
documents = SimpleDirectoryReader(input_files=["blogs/basic_to_advanced_rag/gemma.pdf"]).load_data()
Generate Questions
Generate a set of evaluation questions based on the loaded data. This step focuses on a subset of the document (e.g., the first three pages, but you can extend the same process to the whole pdf), which helps in creating targeted questions that will be used to assess model responses.
eval_documents = documents[:3]
data_generator = DatasetGenerator.from_documents(eval_documents)
eval_questions = data_generator.generate_questions_from_nodes()
Now that questions have been generated, let’s have a look at those questions
eval_questions
['What is the significance of the Gemma models in relation to Gemini models?',
'Describe the two sizes of Gemma models and their intended applications.',
'What are the key features of the training process used for Gemma models?',
'How does Gemma advance the state-of-the-art performance in text-based tasks?',
'Explain the importance of releasing both pre-trained and fine-tuned Gemma checkpoints.',
'What are the domains in which Gemma models have demonstrated strong performance?',
'How does Gemma build upon previous work in sequence models and transformers?',
"Discuss the role of Google's open models and ecosystems in the development of Gemma.",
'Why is the responsible release of LLMs considered critical by the authors?',
'What are the potential benefits of enabling rigorous evaluation and analysis of current LLM techniques?',
'What is the significance of the "page_label" field in the context information?',
'What is the file type and size of the document?',
'When was the document created and last modified?',
'What is the main topic of the document?',
'What is the difference between the "Question Answering" and "Reasoning" capabilities of the Gemma model?',
'What is the "context length" used during training of the Gemma model?',
'What improvements were made to the transformer decoder architecture in the Gemma model?',
'What is the difference between "multi-head attention" and "multi-query attention"?',
'What is the purpose of using "RoPE Embeddings" in the Gemma model?',
'What is the significance of the large vocabulary size in the Gemma models?',
'Explain the role of RMSNorm in the Gemma model architecture.',
'Describe the training infrastructure used for the Gemma models, including the number of TPUv5e and the data replication strategy.',
"How does the 'single controller' programming paradigm simplify the training process for Gemma?",
'What is the estimated carbon footprint of pretraining the Gemma models?',
'How is the pre-training dataset for Gemma filtered to mitigate risks?',
'What is the purpose of the SentencePiece tokenizer used in Gemma?',
'How does the Gemma tokenizer handle unknown tokens?',
'What is the rationale behind filtering evaluation sets from the pre-training data mixture?',
'Explain the process of determining the final data mixture for Gemma.']
Initializing Evaluators
Set up evaluation tools like FaithfulnessEvaluator
and RelevancyEvaluator
to measure the accuracy and relevance of responses generated by different chunk sizes. These metrics will help determine the optimal chunk size for processing the dataset.
Faithfulness: Evaluate whether the response from a query engine aligns with any source nodes. This helps determine if the response is accurate or if it was fabricated.
Relevancy: Assess whether the response and source nodes align with the query. This is useful for determining if the response effectively answers the query.
faithfulness = FaithfulnessEvaluator()
relevancy = RelevancyEvaluator()
Evaluating Responses by Chunk Size
Define a function to evaluate response time, faithfulness, and relevancy for various chunk sizes. This function processes the data in chunks, queries the model, and computes the average metrics to assess performance.
def evaluate(chunk_size, eval_questions):
total_response_time = 0
total_faithfulness = 0
total_relevancy = 0
vector_index = VectorStoreIndex.from_documents(
eval_documents
)
query_engine = vector_index.as_query_engine()
num_questions = len(eval_questions)
for question in eval_questions:
start_time = time.time()
response_vector = query_engine.query(question)
elapsed_time = time.time() - start_time
faithfulness_result = faithfulness.evaluate_response(
response=response_vector
).passing
relevancy_result = relevancy.evaluate_response(
query=question, response=response_vector
).passing
total_response_time += elapsed_time
total_faithfulness += faithfulness_result
total_relevancy += relevancy_result
average_response_time = total_response_time / num_questions
average_faithfulness = total_faithfulness / num_questions
average_relevancy = total_relevancy / num_questions
return average_response_time, average_faithfulness, average_relevancy
Chunk Size-Based Testing
Run tests with different chunk sizes to identify the optimal configuration for response time, faithfulness, and relevancy. The code iterates over multiple chunk sizes, providing insights into how chunk size impacts the model’s performance and accuracy.
chunk_sizes = [128, 256, 512, 1024, 2048]
for chunk_size in chunk_sizes:
avg_response_time, avg_faithfulness, avg_relevancy = evaluate(chunk_size, eval_questions)
print(f"Chunk size {chunk_size} - Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")
Conclusion
The analysis of various chunk sizes shows that chunk size 1024 strikes the best balance in model performance. Larger chunk sizes also enhance faithfulness, reducing the likelihood of generating inaccurate or hallucinated responses, with the highest faithfulness scores at 1024 and 2048. Relevancy remains consistently high across different chunk sizes, peaking at 128 and 1024, though it slightly decreases at 256, 512, and 2048. Overall, chunk size 1024 offers the optimal combination of efficiency, accuracy, and relevance, making it the ideal choice for processing queries.
While chunk size 1024 seems to offer a great balance of speed, accuracy, and relevance, there’s no “one size fits all” solution. The optimal chunk size can vary depending on your specific use case. So, buckle up and run your own evaluation to discover the chunk size that best suits your needs. Finding that sweet spot could make all the difference in maximizing your model’s performance!
If you want to have a look at the latest Meta LLama 3.1 models and their features, consider going through this blog below
References:
https://docs.llamaindex.ai/en/stable/examples/evaluation/relevancy_eval/
https://docs.llamaindex.ai/en/stable/examples/evaluation/faithfulness_eval/