Understanding key stages of RAG with ‘SODHPUCHH(सोधपुछ):A query engine built with Llamaindex’ (Part — II)
This article is a continuation of the previous article where we discussed LlamaIndex and RAG in brief. Today, we will describe each stage of RAG in detail with code examples.
Stages of RAG
Stage 1: Loading(Ingestion)
In RAG, we want the LLM to act on our data. To do so, we need to process and load the data. ‘Loading data’ simply means obtaining the data from different sources and formatting it into appropriate objects.
How loading works in Llamaindex?
- Data Connectors often called ‘Reader’ ingest data from various sources.
- Ingested data is then converted into ‘Document’ objects.
Let’s do some loading task for our own data.
from llama_index.core import SimpleDirectoryReader
# Load law documents
law_docs = {}
law_directory = "data"
for filename in os.listdir(law_directory):
if filename.endswith(".pdf"):
file_title = process_name(os.path.splitext(filename)[0])
loaded_data_path = os.path.join("data", file_title)
f = os.path.join(law_directory, filename)
law_docs[file_title] = SimpleDirectoryReader(input_files=[f]).load_data()
We went through an easy way and used a data connector called ‘SimpleDirectoryReader’ to create documents out of every PDF file in the “data” directory.
Stage 2: Indexing
After the data is loaded, we now have a list of document objects. To ensure that the data can be retrieved and used optimally by the LLM, we need to transform the data. This includes chunking, extracting metadata, and embedding each chunk. These transformed data objects are usually called ‘Nodes’.
To query these ‘Document’ or ‘Node’ objects, we need to build an Index. An Index is a data structure composed of document objects. It is designed to enable querying by an LLM so that the application will always work with the most relevant data.
There can be several types of indexes, and the ‘Vector Store Index’ is the most frequently used.
How does vector store index work?
- The Vector Store Index takes our Documents and splits them up into Nodes.
- It creates vector embeddings of the text of every node, ready to be queried by an LLM.
Now let’s index and embed our data.
# Initialize node parser
node_parser = SentenceSplitter()
# Initialize list to store all nodes
all_nodes = []
# Loop through law documents
for file_title, node in law_docs.items():
# Define paths for index persistence
persist_dir = os.path.join("data", file_title)
# Split nodes into sentences
nodes = node_parser.get_nodes_from_documents(node)
all_nodes.extend(nodes)
# Build or load vector index
if not os.path.exists(persist_dir):
vector_index = VectorStoreIndex(nodes)
vector_index.storage_context.persist(persist_dir=persist_dir)
else:
vector_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=persist_dir))
# Build summary index
summary_index = SummaryIndex(nodes)
Stage 3: Storing
By default, the indexed data is stored only in memory, but we may want to load and index the data repeatedly. For this reason, we store the indexed data.
As seen in above code snippet, vector_index.storage_context.persist(persist_dir=persist_dir) is a built in method of Index which writes data to disk at ‘persist_dir’ which is “data” directory. It is a simple way to store our indexed data.
Once we have loaded our data, indexed it and saved it for later, we are now ready to query. But what does it mean to query?
We will discuss querying and wrap up this series of RAG in the next article.