Basic to Advanced RAG using LlamaIndex (Estimating optimal chunk size)~ 2

8 min readAug 6, 2024

Welcome back to the series of Basic to Advanced RAG. We are moving forward with the second installment of the series, where we will find out the optimal chunk_size of our Retrieval Augmented Generation (RAG) application.

First, let’s discuss why chunk_size matters in the whole process, and then proceed with the implementation of the same.

The Importance of Chunk Size

Optimal chunk_size is a very critical parameter that can affect the efficiency and accuracy of our RAG application

Imagine you are making a RAG application over an app guide (user manual) consisting of information about a particular app's features. Since the information per topic or per feature will be small because user manuals are meant to be short and precise, using a larger chunk size will be useless because now we will be feeding irrelevant information to the LLM. Also, this application should be fast enough to respond to user queries quickly. As the chunk_size increases, the volume of information will also increase, which will get fed to the LLM to generate an answer, which might slow down the system.

But let’s say you are making a RAG application over top of the research papers. Now, for a better response to the query as far as scientific research papers are concerned, using a shorter chunk_sizewould fail because important information might not be present, and with the incomplete context, LLM won’t be able to respond accurately. In this application, response timing won’t be necessary, but the correctness and relevancy of the answer matter the most.

So now we can understand that the choice of chunk_size parameter isn't solid but an application-based parameter where a balance needs to be found such that response time is maintained along with capturing vital information.

If you are new to the concept of RAG, please consider going through this theoretical blog to understand the logic behind using RAG

But what is RAG??

Imagine you are in the exam room, writing your paper peacefully, and you are happy that you have answered most of the…

medium.com

You can explore the other part of this series below:

Basic to Advanced RAG using LlamaIndex ~1

Welcome to “Basic to Advanced RAG using LlamaIndex ~1” the first installment in a comprehensive blog series dedicated…

medium.com

Basic to Advanced RAG using LlamaIndex (Mastering Embedding Model Selection for Peak Performance)~…

At the heart of RAG’s success lies a critical component: choosing the right embedding models. In this blog, we’ll…

medium.com

Basic to Advanced RAG using LlamaIndex (Optimizing Performance with Rerankers)~4

Re-rankers are key to making RAG systems more accurate. They help by sorting and selecting the most relevant…

medium.com

Now let’s proceed to practical implementation for the evaluation of optimal chunk size.

But before that, I want to draw a quick roadmap on how we will be proceeding to evaluate the optimal chunk_size.

1. Construct a vanilla RAG pipeline and keep every parameter (embedding model, LLM, vector database) constant but chunk_size variable.
2. Use the question-answer pair from the source document, which will be used for the evaluation of RAG by some metrics, which will be converted in some time. If question-answer pairs are not available, don’t worry, one can generate them using LLM (indeed a lifesaver sometimes).
3. Experiment with different chunk_size, systematically varying them to observe the impact on performance. Measure the effectiveness of each configuration using the defined evaluation metrics.

Great!! now that we have a map ready, let’s dive into the code implementation

Set up the libraries

Start by installing the necessary libraries to handle LLM-based evaluation tasks, such as llama-index, spacy, and others. These libraries facilitate the loading, processing, and analysis of large text datasets.

!pip install llama-index
!pip install llama-index-llms-gemini
!pip install llama-index-embeddings-huggingface
!pip install spacy

Import Libraries

After installing the required packages, import essential libraries and modules like SimpleDirectoryReader, VectorStoreIndex, DatasetGenerator, and the evaluation tools. This step sets up your environment for handling and processing large text documents effectively.

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings,
)
from llama_index.core.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator
)
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import os
import time
import nest_asyncio

nest_asyncio.apply()

In this blog, we’ve leveraged the power of Llama3.1 with 8B parameters, using the Ollama platform as our LLM. For more in-depth information about Llama3.1, be sure to check out the detailed blog linked below.

Llama 3.1: Everything You Need to Know About Meta’s Latest AI-Language Model

Meta’s Llama 3.1 is here, and it’s revolutionizing the AI landscape. If you’ve been curious about the latest…

medium.com

For embeddings, the bge-small-en-1.5 model is used throughout, but feel free to experiment with other options from the list below.

MTEB Leaderboard - a Hugging Face Space by mteb

Discover amazing ML apps made by the community

huggingface.co

llm = Ollama(model="llama3.1")
Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

Using Settings , you make the llm and embedding model as a global variables.

Load Data

Load your dataset using SimpleDirectoryReader, targeting specific files for evaluation. In this case, the code focuses on a PDF file containing research on Gemma models. This data forms the basis for generating evaluation questions and performing assessments.

documents = SimpleDirectoryReader(input_files=["blogs/basic_to_advanced_rag/gemma.pdf"]).load_data()

Generate Questions

Generate a set of evaluation questions based on the loaded data. This step focuses on a subset of the document (e.g., the first three pages, but you can extend the same process to the whole pdf), which helps in creating targeted questions that will be used to assess model responses.

eval_documents = documents[:3]
data_generator = DatasetGenerator.from_documents(eval_documents)
eval_questions = data_generator.generate_questions_from_nodes()

Now that questions have been generated, let’s have a look at those questions

eval_questions

['What is the significance of the Gemma models in relation to Gemini models?',
 'Describe the two sizes of Gemma models and their intended applications.',
 'What are the key features of the training process used for Gemma models?',
 'How does Gemma advance the state-of-the-art performance in text-based tasks?',
 'Explain the importance of releasing both pre-trained and fine-tuned Gemma checkpoints.',
 'What are the domains in which Gemma models have demonstrated strong performance?',
 'How does Gemma build upon previous work in sequence models and transformers?',
 "Discuss the role of Google's open models and ecosystems in the development of Gemma.",
 'Why is the responsible release of LLMs considered critical by the authors?',
 'What are the potential benefits of enabling rigorous evaluation and analysis of current LLM techniques?',
 'What is the significance of the "page_label" field in the context information?',
 'What is the file type and size of the document?',
 'When was the document created and last modified?',
 'What is the main topic of the document?',
 'What is the difference between the "Question Answering" and "Reasoning" capabilities of the Gemma model?',
 'What is the "context length" used during training of the Gemma model?',
 'What improvements were made to the transformer decoder architecture in the Gemma model?',
 'What is the difference between "multi-head attention" and "multi-query attention"?',
 'What is the purpose of using "RoPE Embeddings" in the Gemma model?',
 'What is the significance of the large vocabulary size in the Gemma models?',
 'Explain the role of RMSNorm in the Gemma model architecture.',
 'Describe the training infrastructure used for the Gemma models, including the number of TPUv5e and the data replication strategy.',
 "How does the 'single controller' programming paradigm simplify the training process for Gemma?",
 'What is the estimated carbon footprint of pretraining the Gemma models?',
 'How is the pre-training dataset for Gemma filtered to mitigate risks?',
 'What is the purpose of the SentencePiece tokenizer used in Gemma?',
 'How does the Gemma tokenizer handle unknown tokens?',
 'What is the rationale behind filtering evaluation sets from the pre-training data mixture?',
 'Explain the process of determining the final data mixture for Gemma.']

Initializing Evaluators

Set up evaluation tools like FaithfulnessEvaluator and RelevancyEvaluator to measure the accuracy and relevance of responses generated by different chunk sizes. These metrics will help determine the optimal chunk size for processing the dataset.

Faithfulness: Evaluate whether the response from a query engine aligns with any source nodes. This helps determine if the response is accurate or if it was fabricated.
Relevancy: Assess whether the response and source nodes align with the query. This is useful for determining if the response effectively answers the query.

faithfulness = FaithfulnessEvaluator()
relevancy = RelevancyEvaluator()

Evaluating Responses by Chunk Size

Define a function to evaluate response time, faithfulness, and relevancy for various chunk sizes. This function processes the data in chunks, queries the model, and computes the average metrics to assess performance.

def evaluate(chunk_size, eval_questions):

    total_response_time = 0
    total_faithfulness = 0
    total_relevancy = 0

    vector_index = VectorStoreIndex.from_documents(
        eval_documents
    )

    query_engine = vector_index.as_query_engine()
    num_questions = len(eval_questions)

    for question in eval_questions:
        start_time = time.time()
        response_vector = query_engine.query(question)
        elapsed_time = time.time() - start_time
        
        faithfulness_result = faithfulness.evaluate_response(
            response=response_vector
        ).passing
        
        relevancy_result = relevancy.evaluate_response(
            query=question, response=response_vector
        ).passing

        total_response_time += elapsed_time
        total_faithfulness += faithfulness_result
        total_relevancy += relevancy_result

    average_response_time = total_response_time / num_questions
    average_faithfulness = total_faithfulness / num_questions
    average_relevancy = total_relevancy / num_questions

    return average_response_time, average_faithfulness, average_relevancy

Chunk Size-Based Testing

Run tests with different chunk sizes to identify the optimal configuration for response time, faithfulness, and relevancy. The code iterates over multiple chunk sizes, providing insights into how chunk size impacts the model’s performance and accuracy.

chunk_sizes = [128, 256, 512, 1024, 2048]

for chunk_size in chunk_sizes:
  avg_response_time, avg_faithfulness, avg_relevancy = evaluate(chunk_size, eval_questions)
  print(f"Chunk size {chunk_size} - Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")

Conclusion

The analysis of various chunk sizes shows that chunk size 1024 strikes the best balance in model performance. Larger chunk sizes also enhance faithfulness, reducing the likelihood of generating inaccurate or hallucinated responses, with the highest faithfulness scores at 1024 and 2048. Relevancy remains consistently high across different chunk sizes, peaking at 128 and 1024, though it slightly decreases at 256, 512, and 2048. Overall, chunk size 1024 offers the optimal combination of efficiency, accuracy, and relevance, making it the ideal choice for processing queries.

While chunk size 1024 seems to offer a great balance of speed, accuracy, and relevance, there’s no “one size fits all” solution. The optimal chunk size can vary depending on your specific use case. So, buckle up and run your own evaluation to discover the chunk size that best suits your needs. Finding that sweet spot could make all the difference in maximizing your model’s performance!

If you want to have a look at the latest Meta LLama 3.1 models and their features, consider going through this blog below

Llama 3.1: Everything You Need to Know About Meta’s Latest AI-Language Model

Meta’s Llama 3.1 is here, and it’s revolutionizing the AI landscape. If you’ve been curious about the latest…

medium.com

References:

Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex - LlamaIndex, Data Framework for…

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs).

www.llamaindex.ai

https://docs.llamaindex.ai/en/stable/examples/evaluation/relevancy_eval/

https://docs.llamaindex.ai/en/stable/examples/evaluation/faithfulness_eval/

Basic to Advanced RAG using LlamaIndex (Estimating optimal chunk size)~ 2

The Importance of Chunk Size

But what is RAG??

Imagine you are in the exam room, writing your paper peacefully, and you are happy that you have answered most of the…

Basic to Advanced RAG using LlamaIndex ~1

Welcome to “Basic to Advanced RAG using LlamaIndex ~1” the first installment in a comprehensive blog series dedicated…

Basic to Advanced RAG using LlamaIndex (Mastering Embedding Model Selection for Peak Performance)~…

At the heart of RAG’s success lies a critical component: choosing the right embedding models. In this blog, we’ll…

Basic to Advanced RAG using LlamaIndex (Optimizing Performance with Rerankers)~4

Re-rankers are key to making RAG systems more accurate. They help by sorting and selecting the most relevant…

Set up the libraries

Import Libraries

Llama 3.1: Everything You Need to Know About Meta’s Latest AI-Language Model

Meta’s Llama 3.1 is here, and it’s revolutionizing the AI landscape. If you’ve been curious about the latest…

MTEB Leaderboard - a Hugging Face Space by mteb

Discover amazing ML apps made by the community

Load Data

Generate Questions

Initializing Evaluators

Evaluating Responses by Chunk Size

Chunk Size-Based Testing

Conclusion

Llama 3.1: Everything You Need to Know About Meta’s Latest AI-Language Model

Meta’s Llama 3.1 is here, and it’s revolutionizing the AI landscape. If you’ve been curious about the latest…

References:

Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex - LlamaIndex, Data Framework for…

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs).

Written by Abhishek Selokar