How to build Chatbot with advance RAG system with by using LlamaIndex and OpenSource LLM with Flask Part-2

9 min readApr 27, 2024

Welcome to our blog, here we delve into the world of natural language processing (NLP) and explore innovative ways to optimize its applications.

In this blog, we focus on enhancing the RAG (Retrieval-Augmented Generation) system, a powerful framework that combines retrieval-based and generation-based approaches to answer complex questions. We’ll explore how integrating an auto merger and reranker into the RAG system can significantly improve its performance and efficiency.

This is my third blog related RAG application. In my pervious blogs I have covered following topics:

In my first blog, I have covered naive RAG architecture for beginners, How to develop a basic Chatbot by using Opensource LLM and naïve RAG system with Lang Chain and Flask framework.

2. In my second blog, I have worked on Sentence Window Retriever to handle limitation of RAG system while developing a Chatbot with Advance RAG system with pre-retriever optimization and post retriever optimization.

Further in this blog, I’ll be delving into strategies for boosting RAG system performance through both pre-retriever and post-retriever optimization techniques. We’ll explore the integration of auto-merging retrieval and sentence reranking to elevate the effectiveness of our RAG system.

Pre-retrieval Process:
The first stage prioritizes refining the indexing structure and fine-tuning the initial query. The goal is to enhance the quality of indexed content through various strategies, including refining data granularity, optimizing index structures, incorporating metadata, aligning optimization, and implementing mixed retrieval. Query optimization aims to clarify and customize the user’s original query to enhance retrieval quality. In this context, two techniques will be employed for pre-retrieval optimization: Sentence Window Retriever and Auto-merging Retrieval. We already covered Sentence Window Retriever in our pervious blog.

Auto-merging Retrieval:
The AutoMergingRetriever examines a collection of leaf nodes and systematically combines subsets of leaf nodes that reference a parent node beyond a specified threshold. This consolidation process enables the aggregation of smaller, potentially disparate contexts into larger contexts, facilitating synthesis.

How does it function?

1. It divides the document into several text chunks (as seen in the below image, with a chunk size of 512).

2. Additionally, it segments the “parent” chunks into smaller “child” chunks (each with a size of 128).

3. During querying, it initially retrieves smaller chunks based on embedding similarity.

4. If the majority of these subset chunks are chosen based on embedding similarity, the parent chunk is returned; otherwise, only the selected child chunks are returned.

from llama_index.node_parser import HierarchicalNodeParser

# create the hierarchical node parser w/ default settings
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

Post-retriever Process:
Once relevant context is retrieved, seamless integration with the query becomes crucial. Important techniques at this stage involve reranking chunks and compressing context. Reranking the retrieved information focuses on prioritizing the most pertinent content, a process frequently facilitated by frameworks such as LlamaIndex and LangChain.

Reranker model:
A reranking model is a model type designed to output a similarity score when given a query and document pair. This score is utilized to reorder documents based on their relevance to the query.

How it works:
1. In such systems, a first-stage model, typically an embedding model or retriever, retrieves a subset of relevant documents from a larger dataset.

2. Subsequently, a second-stage model, the reranker, is employed to reorder the documents retrieved by the first-stage model.

This two-stage approach is adopted because retrieving a small subset of documents from a large dataset is considerably faster than reranking a large set of documents.

Re-ranking is essential because the initial retriever in the first stage may have limitations. It could potentially assign high rankings to irrelevant documents while assigning lower scores to some relevant ones. Consequently, not all top-k documents are necessarily relevant, and some relevant documents may not be included in the top-k. The re-ranker serves to refine these outcomes, ensuring that the most relevant answers are elevated.

from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

automerging_retriever = automerging_index.as_retriever(
    similarity_top_k=06
)

retriever = AutoMergingRetriever(
    automerging_retriever, 
    automerging_index.storage_context, 
    verbose=True
)

rerank = SentenceTransformerRerank(top_n=3, model="BAAI/bge-reranker-base")

auto_merging_engine = RetrieverQueryEngine.from_args(
    automerging_retriever, node_postprocessors=[rerank]
)

Now, Let build RAG based Chatbot with Auto Merging Retriever + Reranker Transformer

Load Document:

import utils

import os
import openai
openai.api_key = utils.get_openai_api_key()

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()

from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

2. Auto-merging retrieval setup :

from llama_index.node_parser import HierarchicalNodeParser

# create the hierarchical node parser w/ default settings
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

nodes = node_parser.get_nodes_from_documents([document])

“get_leaf_nodes” method is used to return leaf node data and below are the full code:

from llama_index.node_parser import get_leaf_nodes

leaf_nodes = get_leaf_nodes(nodes)
print(leaf_nodes[30].text)

Output for leaf Node:
But this became less important as numerical linear algebra libraries matured. Deep learning is still an emerging technology, so when you train a neural network and the optimization algorithm struggles to converge, understanding the math behind gradient descent, momentum, and the Adam optimization algorithm will help you make better decisions. Similarly, if your neural network does something funny — say, it makes bad predictions on images of a certain resolution, but not others —understanding the math behind neural network architectures puts you in a better position to figure out what to do. Of course, I also encourage learning driven by curiosity.

“leaf_nodes[30].parent_node.node_id” code is used to return parent node data it also include leaf node 30 data in italic and bold font and below are the full code:

nodes_by_id = {node.node_id: node for node in nodes}

parent_node = nodes_by_id[leaf_nodes[30].parent_node.node_id]
print(parent_node.text)

Output for Parent Node:
“PAGE 12Should You
Learn Math to
Get a Job in AI? CHAPTER 3
LEARNING

PAGE 13 Should you Learn Math to Get a Job in AI? CHAPTER 3
Is math a foundational skill for AI? It’s always nice to know more math! But there’s so much to learn that, realistically, it’s necessary to prioritize. Here’s how you might go about strengthening your math background. To figure out what’s important to know, I find it useful to ask what you need to know to make the decisions required for the work you want to do. At DeepLearning.AI, we frequently ask, “What does someone need to know to accomplish their goals?” The goal might be building a machine learning model, architecting a system, or passing a job interview. Understanding the math behind algorithms you use is often helpful, since it enables you to
debug them. But the depth of knowledge that’s useful changes over time. As machine learning techniques mature and become more reliable and turnkey, they require less debugging, and a shallower understanding of the math involved may be sufficient to make them work. For instance, in an earlier era of machine learning, linear algebra libraries for solving linear
systems of equations (for linear regression) were immature. I had to understand how these libraries worked so I could choose among different libraries and avoid numerical roundoff pitfalls.

But this became less important as numerical linear algebra libraries matured. Deep learning is still an emerging technology, so when you train a neural network and the
optimization algorithm struggles to converge, understanding the math behind gradient descent, momentum, and the Adam optimization algorithm will help you make better decisions. Similarly, if your neural network does something funny — say, it makes bad predictions on images of a certain resolution, but not others — understanding the math behind neural network architectures puts you in a better position to figure out what to do. Of course, I also encourage learning driven by curiosity.

If something interests you, go ahead and learn it regardless of how useful it might turn out to be! Maybe this will lead to a creative spark or technical breakthrough. How much math do you need to know to be a machine learning engineer?”

3. Building the index:

from llama_index.llms import OpenAI
from llama_index import ServiceContext

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)



auto_merging_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    node_parser=node_parser,
)

4. Storage Context:

from llama_index import VectorStoreIndex, StorageContext

storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

automerging_index = VectorStoreIndex(
    leaf_nodes, storage_context=storage_context, service_context=auto_merging_context
)

automerging_index.storage_context.persist(persist_dir="./merging_index")

5. Defining the Auto Merging + Reranking retriever and running the query engine:

from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

automerging_retriever = automerging_index.as_retriever(
    similarity_top_k=12
)

retriever = AutoMergingRetriever(
    automerging_retriever, 
    automerging_index.storage_context, 
    verbose=True
)

rerank = SentenceTransformerRerank(top_n=6, model="BAAI/bge-reranker-base")

auto_merging_engine = RetrieverQueryEngine.from_args(
    automerging_retriever, node_postprocessors=[rerank]
)

auto_merging_response = auto_merging_engine.query(
    "What is the importance of networking in AI?"
)

Final Response: Networking in AI is crucial as it allows individuals to build a strong professional community that can provide valuable information, support, and opportunities. By connecting with others in the field, individuals can receive guidance, referrals to potential employers, and access to mentors or peers who can help them advance in their careers. Additionally, networking helps individuals stay updated on the latest trends and developments in AI, fostering continuous learning and growth within the community.

Put All together to build a chatbot by using RAG auto merging and reranking retriever:

from flask import Flask, request, render_template, jsonify


app = Flask(__name__)

import os
from llama_index.core import ServiceContext, VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.core.indices.postprocessor import SentenceTransformerRerank
from llama_index.core import load_index_from_storage
from llama_index.core.node_parser import HierarchicalNodeParser
#from gpt4all import GPT4All
from langchain.llms import GPT4All


def build_auto_merging_index(documents,llm, embed_model="local:BAAI/bge-small-en-v1.5",sentence_window_size=3,
                                save_dir="sentence_index",):

        # create the hierarchical node parser w/ default settings
    node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048, 512, 128])

    auto_merging_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )
    if not os.path.exists(save_dir):
        auto_merging_index = VectorStoreIndex.from_documents(
            documents, service_context=auto_merging_context
        )
        auto_merging_index.storage_context.persist(persist_dir=save_dir)
    else:
        auto_merging_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=save_dir),
            service_context=auto_merging_context,
        )

    return auto_merging_index


def get_auto_merging_reranker_engine(auto_merging_index, similarity_top_k=6, rerank_top_n=2):
    
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )

    auto_merging_engine = auto_merging_index.as_query_engine(
        similarity_top_k=similarity_top_k, node_postprocessors=[rerank]
    )
    return auto_merging_engine

from llama_index.llms.openai import OpenAI
from data.dataprovider import key
from llama_index.core import SimpleDirectoryReader
#OpenAI.api_key =  key

documents = SimpleDirectoryReader(
    input_files=[r"data/eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()

from llama_index.core import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

index = build_auto_merging_index(
    [document],
    #llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1,api_key=key),
    #llm = GPT4All("mistral-7b-openorca.gguf2.Q4_0.gguf"),
    llm = GPT4All(model=r'C:\Users\91941\.cache\gpt4all\orca-mini-3b-gguf2-q4_0.gguf'), #Replace this path with your model path
    save_dir="./auto_merge_index",
)

query_engine = get_auto_merging_reranker_engine(index, similarity_top_k=6)

def chat_bot_rag(query):
    window_response = query_engine.query(
        query
    )

    return window_response



# Define your Flask routes
@app.route('/')
def home():
    return render_template('bot_1.html')

@app.route('/chat', methods=['POST'])
def chat():
    user_message = request.form['user_input']    
    bot_message = chat_bot_rag(user_message)    
    return jsonify({'response': str(bot_message)})

if __name__ == '__main__':
    #app.run()
    app.run(debug=True)

Git hub code:

GitHub - ranadevrat/rag_auto_merger_reranker

Contribute to ranadevrat/rag_auto_merger_reranker development by creating an account on GitHub.

github.com

Conclusion:

In this blog, we’ve undertaken an in-depth exploration of advanced techniques geared towards augmenting the efficacy of RAG models. By integrating the Auto Merging Retrieval + Sentence Reranker RAG technique, we’ve taken significant strides in enhancing their capabilities.

However, our exploration doesn’t conclude here. Be sure to stay tuned for our forthcoming blog, where we’ll delve further into the innovative RAG Graph, RAG Agents, and Memorization, providing additional insights into refining and optimizing RAG models for superior performance.

Join us as we persist in uncovering the potential of these cutting-edge methodologies in reshaping the landscape of natural language processing.

References:

https://learn.deeplearning.ai/courses/building-evaluating-advanced-rag/lesson/5/auto-merging-retrieval

Rerankers and Two-Stage Retrieval | Pinecone

Search engineers have used rerankers in two-stage retrieval systems for a long time. In these two-stage systems, a…

www.pinecone.io

https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/

Using LLM's for Retrieval and Reranking - LlamaIndex, Data Framework for LLM Applications

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs).

www.llamaindex.ai

https://medium.com/@ranadevrat/build-a-chatbot-with-advance-rag-system-with-llamaindex-opensource-llm-flask-and-langchain-1bf875be3ec6