Creating a Chatbot Using Open-Source LLM and RAG Technology with Lang Chain and Flask

12 min readApr 5, 2024

Image Credit to https://www.junia.ai/tools/blog-images

Introduction to Generative AI

In the realm of artificial intelligence (AI), generative AI stands out as a revolutionary concept that has the potential to reshape numerous industries and transform how we interact with technology. At its core, generative AI refers to algorithms and models capable of creating new content, whether it be text, images, music, or even entire virtual environments, without direct human input. This Introduction to Generative AI delves into the fundamental principles underlying this groundbreaking technology.

Generative AI algorithms, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Transformers, operate by learning patterns and structures from vast datasets, enabling them to generate novel outputs that mimic the characteristics of the training data. These algorithms have demonstrated remarkable capabilities in a wide range of applications, from generating lifelike images and synthesizing realistic speech to composing music and crafting immersive narratives.

Moreover, the introduction explores the ethical implications and societal impacts associated with the proliferation of generative AI. As these systems become increasingly proficient at emulating human creativity, questions arise regarding intellectual property rights, authenticity, and the potential for misuse, underscoring the importance of responsible development and deployment.

Elevating App Development: Leveraging LLM and RAG Systems

In the dynamic landscape of app development, leveraging advanced natural language processing (NLP) techniques can greatly enhance user experiences and functionality. Two prominent approaches that have garnered attention in recent years are Fine-tuning Language Models (LLM) and implementing Retrieval-Augmented Generation (RAG) systems.

Fine-tuning Language Models:
Fine-tuning an existing language model, such as GPT (Generative Pre-trained Transformer), offers app developers a powerful tool for tailoring AI capabilities to specific tasks or domains. By fine-tuning an LLM on task-specific data, developers can imbue their apps with advanced text generation, summarization, and language understanding capabilities. Whether it’s generating personalized responses in chat applications, summarizing articles for news aggregation platforms, or aiding in language translation services, fine-tuned LLMs can elevate the user experience by delivering more relevant and contextually accurate outputs.

Retrieval-Augmented Generation Systems:
Incorporating Retrieval-Augmented Generation (RAG) systems into app development introduces a hybrid approach that combines generative and retrieval-based mechanisms. RAG systems augment generative models with access to external knowledge sources, enabling apps to provide more informed, factual, and contextually relevant responses. For instance, in question answering applications, RAG systems can retrieve information from vast knowledge bases to ensure accurate and comprehensive answers. Similarly, in content generation apps, RAG enhances the diversity and accuracy of generated content by integrating retrieved knowledge into the generative process, resulting in richer and more engaging outputs.

Practical Applications:
Both LLM fine-tuning and RAG systems offer a myriad of practical applications across various domains of app development. From intelligent virtual assistants and chatbots that provide personalized responses to educational apps that offer interactive learning experiences, the integration of advanced NLP techniques can significantly enhance user engagement and satisfaction. Moreover, by leveraging these techniques, developers can stay ahead of the curve in delivering innovative solutions that adapt to users’ evolving needs and preferences.

Here are the differences between fine-tuning an LLM and implementing a RAG system:

Fine-Tuning a Language Model (LLM):
- Fine-tuning involves training a pre-existing language model on a specific dataset or task to adapt its parameters and optimize its performance for that task.
- LLMs, such as GPT (Generative Pre-trained Transformer) models, are typically trained on large, diverse corpora and exhibit strong generative capabilities across a wide range of tasks.
- Fine-tuning an LLM requires access to task-specific data and involves updating the model’s parameters through gradient descent-based optimization techniques.
- The fine-tuning process focuses on adjusting the model’s weights to minimize a task-specific loss function, thereby tailoring the model’s behavior to the target task.
- Fine-tuning LLMs is effective for tasks where generative capabilities are essential, such as text generation, summarization, and language understanding.

RAG System:
- A Retrieval-Augmented Generation (RAG) system integrates retrieval-based mechanisms with generative models to enhance context understanding and content generation.
- RAG systems leverage both generative and retrieval components, combining the strengths of each approach to produce more accurate, informative, and contextually relevant outputs.
- Unlike fine-tuning, which primarily involves updating model parameters, implementing a RAG system requires designing and integrating retrieval mechanisms, such as document retrieval and passage ranking algorithms.
- RAG systems access external knowledge sources, such as knowledge graphs or large text corpora, to retrieve relevant information and incorporate it into the generative process.
- The key focus of RAG systems is on leveraging retrieved knowledge to enrich generatively produced content, resulting in outputs that are more diverse, factual, and contextually coherent.
- RAG systems are particularly effective for tasks where access to external knowledge is crucial, such as question answering, content generation with factual accuracy, and dialogue systems requiring contextual understanding.

Build RAG Application with Lang chain

In RAG process, the external data is fetched and subsequently forwarded to the LLM during the generation phase.

Lang Chain offers a comprehensive suite of components for RAG applications, catering to a spectrum of complexity.

RAG components: Document loader, Chunking & transforming, Vector Embedding, Vector Data Base and Retriever

Document Loaders:

Document loaders are tools that fetch documents from various sources. Lang Chain offers number of document loaders. These loaders can fetch different types of documents, such as HTML, PDFs, and code, from various locations, including private S3 buckets and public websites.

Document loaders provide a “load” method for loading data as documents from a configured source. They optionally implement a “lazy load” as well for lazily loading data into memory
Type of Document Loaders:
1. Text Loader
2. CSV Loader
3. HTML Loader
4. PDF Loader
5. Directory Loader
6. JSON Loader
7. Markdown Loader
8. AzureAIDocumentIntelligenceLoader

Text Loader code example:

from langchain_community.document_loaders import TextLoader

loader = TextLoader("./text_file.md")
doc = loader.load()

2. Text Splitting (Chunking):

In retrieval, it’s important to fetch only the relevant parts of documents. This involves several steps to prepare the documents. One crucial step is splitting large documents into smaller parts, or chunks. LangChain offers different algorithms to do this and optimizes the process for specific document types like code and markdown.

Text splitter workflow:
1. Break the text into small, semantically meaningful units, typically chunks (sentences).
2. Merge these small units chunks into larger chunks until reaching a specified size, determined by a certain function.
3. When the chunk reaches this size, treat it as a separate piece of text, then proceed to create a new chunk with some overlap to maintain context between chunks.

We have two options for customizing text splitter:
a. The method used to split the text.
b. The criteria used to measure the size of each chunk.

Chunking Types:
1. Semantic Chunking
2. Split by Tokens
3. Split by Characters
4. Split Code
5. Recursively split by character
6. Recursively split JSON
7. MarkdownHeaderTextSplitter
8. Split by HTML header
9. Split by HTML section


from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents(doc)
print(texts[0])
print(texts[1])

The parameters of the `RecursiveCharacterTextSplitter` are as follows:

- chunk_size: This parameter sets the size of each chunk of text that the splitter will create. In this case, it’s set to 100 characters, meaning that the text will be divided into chunks of 100 characters each.

- chunk_overlap: This parameter determines the amount of overlap between adjacent chunks. In other words, it specifies how many characters from the end of one chunk will be included in the beginning of the next chunk. Here, it’s set to 20, meaning that each chunk will overlap with the previous one by 20 characters.

- length_function: This parameter specifies the function used to determine the length of the text. By default, it’s set to the `len` function, which returns the number of characters in a string.

- is_separator_regex: This parameter determines whether the text splitter should treat certain characters as separators for splitting the text into chunks. If set to `False`, the splitter will not use regular expressions to identify separators.

3. Text Embedding Models:

Another important aspect of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of text, making it easier to find similar pieces of text quickly.
In LangChain, The base Embeddings class serves as an interface for interacting with text embedding models. As there are various providers of embedding models (such as OpenAI, Cohere, Hugging Face, etc.), this class offers a standardized interface compatible with all of them.

Embeddings generate a vector representation of text, enabling us to conceptualize text within a vector space. This capability facilitates tasks like semantic search, where we seek text pieces with similar characteristics in the vector space.

The base Embeddings class provides two main methods:
one for embedding documents and another for embedding queries. The former accepts multiple texts as input, while the latter handles a single text. The reason for separating these into two methods is that certain embedding providers may employ distinct embedding techniques for documents (to be searched) compared to queries (the search input).

pip install langchain-openai

export OPENAI_API_KEY="..."


from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(openai_api_key="...")

embeddings = embeddings_model.embed_documents(
    [
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

(2, 1536)

4. Vector Stores:

Vector Store databases play a pivotal role in the realm of Retrieval-Augmented Generation (RAG), offering efficient storage and retrieval of embeddings essential for enhancing the contextual relevance and accuracy of generated content. These databases serve as repositories for embeddings, which encode the semantic meaning of text, enabling RAG systems to retrieve and incorporate relevant information into generated outputs.

LangChain integrates with over 50 vector stores, ranging from open-source local databases to cloud-hosted proprietary ones. This allows you to select the most suitable option for your requirements and provides a standard interface for easy switching

Key Features:
- Storage Efficiency: Vector Store databases are designed to efficiently store large volumes of embeddings, optimizing storage space and retrieval speed.
- Fast Retrieval: Leveraging advanced indexing and retrieval algorithms, these databases facilitate quick and efficient retrieval of embeddings, supporting real-time generation of contextually relevant content.
- Scalability: Vector Store databases are scalable, allowing them to accommodate growing datasets and support high-throughput retrieval requests, making them suitable for large-scale RAG applications.
- Integration Flexibility: These databases seamlessly integrate with RAG systems, providing a standardized interface for storing, querying, and retrieving embeddings, thereby simplifying system integration and interoperability.
- Customization Options: Vector Store databases offer customization options, allowing users to configure storage parameters, indexing strategies, and retrieval algorithms to suit specific application requirements and performance goals.

5. Retrievers:

Retrievers constitute a foundational component within Retrieval-Augmented Generation (RAG) systems, facilitating the efficient retrieval of relevant information from large knowledge bases to augment the generative process.

In the RAG framework, Retrievers serve as the bridge between the generative model and external knowledge sources, enabling the system to access and incorporate diverse information for contextually enriched content generation.

Key Functions:
- Information Retrieval: Retrievers are responsible for querying and retrieving relevant documents or passages from knowledge bases based on user prompts or input queries. This process involves matching the query with indexed documents and selecting the most pertinent information for generation.
- Contextual Enrichment: By retrieving contextual information from external sources, Retrievers enhance the generative capabilities of RAG systems, ensuring that generated content is informed by a diverse range of sources and remains contextually relevant.
- Optimization and Performance: Retrievers employ advanced algorithms and indexing techniques to optimize retrieval efficiency and performance, enabling rapid access to large volumes of data while maintaining high accuracy and relevance.
- Adaptability and Customization: RAG systems often incorporate diverse retrieval algorithms and strategies to accommodate different use cases and domain-specific requirements. Retrievers can be customized and fine-tuned to prioritize certain types of information or sources based on application needs.
- Integration with Generative Models: Retrievers seamlessly integrate with generative models within the RAG architecture, providing a continuous feedback loop where retrieved information influences the generation of subsequent content, thus enhancing the overall quality and relevance of outputs.

Build a Chatbot App with Opensource LLM and RAG:

Step 1. Document Loader and Transform:
In order to create a Retriever system, it’s necessary to load our data. Below is the code for loading documents. We’re employing a basic TextLoader here, which segments the document content into chunks of 1000 characters each, with an overlap of 100 characters.

from langchain_community.document_loaders import TextLoader

loader = TextLoader(r"C:\chatbot_llms\data\doc_rag.txt")
full_text= loader.load()
#full_text = open(r"C:\chatbot_llms\data\doc_rag.txt", "r").read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_text(full_text)

Step 2. Embedding and Vector DataBase:
Now, we proceed to convert the text data into embedding format. For this task, we utilize the Huggingface embedder model to transform the text data into vector format. Subsequently, we store the embedding data into the FAISS vector database. Below is the code snippet for this process:

embeddings = HuggingFaceEmbeddings()
#db = Chroma.from_texts(texts, embeddings)
db = FAISS.from_texts(texts, embeddings)

Step 3. Retriever:
Below is the code snippet demonstrating the utilization of the FAISS vector database as a retriever. With this implementation, our retriever is now fully functional and ready for use.

retriever = db.as_retriever()

Step 4. LLM and Prompt template:
Now it’s time to utilize the LLM model. We are employing the Zephyr-7b-beta model for this purpose, and Huggingfacehub is utilized to deploy the inference LLM.

Note: You need to replace your huggingface api token in below code snippet and please refer to huggingface website to create new api token:

llm = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
    huggingfacehub_api_token= hg_key,# Replace with your actual huggingface token
)

template = """Answer the question based only on the following context:

    {context}

    Question: {question}
    """
 prompt = ChatPromptTemplate.from_template(template)
 model = llm

Step 5: RAG pipeline with Lang chain :
In the preceding sections, we’ve examined the end-to-end workflow of the RAG system from steps 1 to 4. Now, it’s time to develop a pipeline for the RAG system using a chain of LangChain Expression Language (LCEL).

chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )

Chatbot powered by LLM and RAG in Flask Framework:
Below is the complete code for the Chatbot application we’ve developed, leveraging LLM and RAG with Lang Chain and the Flask framework.

from flask import Flask, request, render_template
import openai
from data.dataprovider import key, hg_key
from langchain.embeddings import HuggingFaceEmbeddings
#from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.vectorstores import FAISS
from langchain_community.llms import HuggingFaceHub

app = Flask(__name__)



llm = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
    huggingfacehub_api_token= hg_key,# Replace with your actual huggingface token
)

# Define your rag chatbot function
def chat_with_rag(message):

    full_text = open(r"C:\chatbot_llms\data\doc_rag.txt", "r").read()
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    texts = text_splitter.split_text(full_text)

    embeddings = HuggingFaceEmbeddings()
    #db = Chroma.from_texts(texts, embeddings)
    db = FAISS.from_texts(texts, embeddings)
    retriever = db.as_retriever()
    template = """Answer the question based only on the following context:

    {context}

    Question: {question}
    """
    prompt = ChatPromptTemplate.from_template(template)
    model = llm


    def format_docs(docs):
        return "\n\n".join([d.page_content for d in docs])


    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )
    
    return chain.invoke(message)

# Define your Flask routes
@app.route('/')
def home():
    return render_template('bot_1.html')

@app.route('/chat', methods=['POST'])
def chat():
    user_message = request.form['user_input']
    bot_message = chat_with_rag(user_message)
    return {'response': bot_message}

if __name__ == '__main__':
    app.run()

Please refer to the GitHub link for the complete code:

GitHub - ranadevrat/rag-app: Chatbot powered by opensource LLM and RAG

Chatbot powered by opensource LLM and RAG. Contribute to ranadevrat/rag-app development by creating an account on…

github.com

Conclusion:
In this blog, we explored the implementation of naive RAG techniques to develop our chatbot. However, we may encounter several limitations inherent to these techniques, particularly when aiming to create an industry-standard chatbot. To address these shortcomings, the utilization of advanced RAG techniques becomes imperative.

In our next blog “Build a Chatbot with Advance RAG System: with LlamaIndex, OpenSource LLM, Flask and LangChain”blog we have covered these advanced techniques, including pre-retrieval, retrieval, and post-retrieval optimizations, to overcome the limitations associated with naive RAG approaches.

My other blogs: