Basic to Advanced RAG using LlamaIndex ~1

Abhishek Selokar
7 min readJun 11, 2024

--

Welcome to “Basic to Advanced RAG using LlamaIndex ~1” the first installment in a comprehensive blog series dedicated to exploring Retrieval-Augmented Generation (RAG) with the LlamaIndex. In this series, we will embark on a journey starting with the fundamentals of Vanilla RAG, setting a solid foundation for understanding how this powerful approach combines retrieval and generation to enhance question-answering tasks. As we progress through the series, we will delve into more advanced topics, uncovering sophisticated techniques and best practices to maximize the potential of RAG with LlamaIndex. Whether you’re a beginner or looking to deepen your knowledge, this series will provide valuable insights and practical guidance to help you master RAG. Let’s get started with the basics and build our way up to advanced applications!

If you want to know the reasoning behind how RAG works you can follow the below blogs.

Now let’s get started and get our hands dirty with the foundational concepts of vanilla RAG. In this first post, we’ll explore how to set up and implement basic RAG using LlamaIndex, preparing you for the more advanced techniques to come.

Installing required libraries

pip install llama-index
pip install llama-index-embeddings-huggingface
pip install llama-index-llms-gemini
pip install -q llama-index google-generativeai

Set up an LLM and embedding model

You first need to select and define which LLM and embedding models are to be used for our RAG.

LLMs are used in different stages:

Indexing: Evaluate data relevance or summarize raw data for efficient indexing.
Querying: During retrieval, LLMs choose the best sources or tools to find information. In response synthesis, they merge sub-query answers into a coherent response or convert text into formats like JSON.

The Settings object is used to provide local configurations (transformations, LLMs, embedding models) as a global default.

Here we will use the Gemini model and the “BAAI/bge-small-en-v1.5” embedding model from hugging face.

Follow the blog below to get the API key for using the Gemini model.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import Settings



GOOGLE_API_KEY = "<YOUR_API_KEY>"
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

# Setting global parameter
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") # set the embedding model
Settings.llm = Gemini(model_name="models/gemini-pro")

Ingestion

Before applying the chosen LLM over the data, you first need to process the data and load it. We use SimpleDirectoryReader which is the most commonly used data connector and it will select the best file reader based on the file extensions so you don’t need to mention the particular reader explicitly.

Here I'm using the research paper of LLaMa2. You can choose any file or document with which you want to chat.

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_files=['llama2.pdf']).load_data()

Transformations

It consists of various components :

Here I'm using only textsplitter but moving forward in this series, we will explore other components too.

The SentenceSplitter attempts to split text while respecting the boundaries of sentences. — Source

from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

# global
from llama_index.core import Settings

Settings.text_splitter = text_splitter

Indexing

An Index is a data structure that allows us to quickly retrieve relevant context for a user query. — Source

One can query an index at a different index and those indexes can have different behavior. For example, in our blog, we will use VectorStoreIndex

VectorStoreIndex: It stores the nodes (basically chunks of the text from the document) and their corresponding embeddings in the vector store. Most similar nodes are retrieved based on the query for the generation of a response. “k” number of chunks are returned and this parameter can be controlled using “top_k” and for this reason, this whole type of search is often referred to as “top-k semantic retrieval”.

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents, transformations=[text_splitter])

Storing

Creating embeddings of the chunks for a large file or more number of files can be expensive and time-consuming. It is better to create those embeddings once and then save them somewhere so that they can be retrieved later. The simplest way to save the indexed data is using the built-in .persist() method, which writes all the data to disk at the specified location only to be retrieved quickly later on.

index.storage_context.persist(persist_dir="/blogs")

it will save the indexed data into the specified directory, now you can re-index your data by loading the persisted index like this:

from llama_index.core import StorageContext, load_index_from_storage

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="/blogs")

# load index
index = load_index_from_storage(storage_context)

Querying

Once the data is loaded, indexed created, and stored, then comes the most significant part: querying.

Let’s define the prompt for our RAG application to guide our LLM effectively. Prompting is crucial when working with LLMs; the structure and clarity of the prompt significantly impact the relevance of the response. Since we have taken the llama2 research paper as our base source for the RAG application, here’s how to craft a well-defined prompt to ensure we get accurate and useful answers from our LLM:


template = """
You are a knowledgeable and precise assistant specialized in question-answering tasks,
particularly from academic and research-based sources.
Your goal is to provide accurate, concise, and contextually relevant answers based on the given information.

Instructions:

Comprehension and Accuracy: Carefully read and comprehend the provided context from the research paper to ensure accuracy in your response.
Conciseness: Deliver the answer in no more than three sentences, ensuring it is concise and directly addresses the question.
Truthfulness: If the context does not provide enough information to answer the question, clearly state, "I don't know."
Contextual Relevance: Ensure your answer is well-supported by the retrieved context and does not include any information beyond what is provided.

Remember if no context is provided please say you don't know the answer
Here is the question and context for you to work with:

\nQuestion: {question} \nContext: {context} \nAnswer:"""


from llama_index.core.prompts import PromptTemplate

prompt_tmpl = PromptTemplate(
template=template,
template_var_mappings={"query_str": "question", "context_str": "context"},
)

You can construct your prompt based on the file topic you are using to chat with to get a better and more relevant response from the LLM.

from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine



# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer()

# assemble query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
)

query_engine.update_prompts(
{"response_synthesizer:text_qa_template":prompt_tmpl}
)

ResponseSynthesizer generates a response from an LLM, using a user query and a given text chunks.

VectorIndexRetriever is used to retrieve most similar text chunks from the indexed data based on the user query.

RetrieverQueryEngine helps to make use of the retrieved text chunks, and pre-defined prompt, and then generate the response.

## Input
response = query_engine.query("What are differet variants of LLama?")
print(response)

## Output
# According to Table A.2 in the provided context, LLaMA comes in a range of parameter sizes—7B, 13B, and 70B—as well as pretrained and fine-tuned variations.
## Input
response = query_engine.query("What are the hyperparamters used for training the model?")
print(response)


## Output

# The hyperparameters used for training the model are:

#* AdamW optimizer with β1=0.9, β2=0.95, and eps=10^(-5)
#* Cosine learning rate schedule with warmup of 2000 steps and decay to 10% of the peak learning rate
#* Weight decay of 0.1 and gradient clipping of 1.0
## Input
response = query_engine.query("Can you please comment on the Carbon Footprint of Pretraining. ")
print(response)


## Output

# The pretraining process for Llama 2 utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). This resulted in estimated total emissions of 539 tCO2eq, which were directly offset by Meta's sustainability program.

Putting all things together

Tadahhhh…!!!! So now we have completed the Vanilla RAG over here, and moving forward will touch upon the advancements to the current pipeline in subsequent blogs in this series.

Please find the second installment of this series “Basic to Advanced RAG using LlamaIndex (Estimating optimal chunk size)~ 2”, which focuses on estimating the right chunk size for use case:

I just wanted to share a closing remark,

A vice man once said “ to earn more you should learn more ”

So keep learning and keep exploring.

--

--

Abhishek Selokar

Masters Student @ Indian Institute Of Technology, Kharagpur || Thirsty to learn more about AI