Q & A ChatBot — A Generative AI Application Using Large Language Models

8 min readAug 1, 2024

Pankaja Ambalgi, M.S., Ernest Bonat, Ph.D.

1. Overview

Imagine a Q&A application capable of answering your questions in a specific domain or from a knowledge base made up of a company’s data in the form of documents, websites, databases, and more. These documents could include business records, patient and doctor conversations, or proprietary data of a company. To achieve such a fine-tuned Q&A application, we use the Retrieval-Augmented Generation (RAG) technique. Using RAG, we can extend the capabilities of Large Language Model (LLM) to refer to the knowledge base for context and required data without the need for retraining them. Fine-tuning LLM with the help of the RAG process is the core concept on which this application is built. In this paper we’ll show how to fine tune the LLMs with the set of personal documents using RAG technique.

2. Main Tools Definition

LLMs: Large Language Models (LLMs) are foundational models in Natural Language Processing (NLP) designed to understand, generate, and manipulate human language. These models, such as GPT-4, BERT, T5, Llama are trained on vast amounts of text data, enabling them to perform a wide range of tasks including text summarization, translation, sentiment analysis, and question-answering. Their ability to capture context, semantics, and nuances in language makes them highly versatile and powerful tools for various NLP applications.

RAG: Retrieval-Augmented Generation (RAG) is a technique used to fine-tune LLMs, especially with proprietary business documents. RAG combines the generative capabilities of LLMs with a retrieval mechanism that fetches relevant information from a database or document corpus. When a question is posed, the model retrieves pertinent documents or snippets and then generates a coherent and accurate answer based on both the retrieved context and its pre-trained knowledge. This approach enhances the model’s accuracy and relevance, making it particularly effective for applications involving specialized or proprietary data. This approach enhances the model’s accuracy and relevance, making it particularly effective for applications involving specialized or proprietary data. In the figure below shows the conceptual flow of using RAG with an LLM.

Conceptual flow of using RAG with LLMs (source from: https://aws.amazon.com/what-is/retrieval-augmented-generation/)

3. Technical Stack

AWS Bedrock: AWS Bedrock is a managed service that simplifies the use of foundational models in machine learning and artificial intelligence. It provides access to a variety of powerful, pre-trained models from leading AI companies, enabling users to build and scale applications quickly and efficiently. With AWS Bedrock, developers can integrate these models into their workflows, leveraging AWS’s robust infrastructure for tasks such as text generation, summarization, and Q&A.

As a prerequisite, we need access to LLM models from AWS Bedrock and have aws configured locally to use aws sdks to use the LLM models. I have given the steps here for setting up the project on your local system.

LangChain: LangChain is a framework designed to facilitate the development of applications powered by LLMs. It provides tools for efficiently loading and processing text data, creating embeddings, managing vector stores, and constructing prompt chains. LangChain simplifies the integration of LLMs into workflows, enabling developers to build sophisticated NLP applications, such as Q&A systems, chatbots, and content summarizers, with ease.

Retrievers: A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

Retrieval chain: This chain takes in a user inquiry, which is then passed to the retriever to fetch relevant documents. Those documents (and original inputs) are then passed to an LLM to generate a response.

Faiss: Faiss (Facebook AI Similarity Search) is an efficient library for similarity search and clustering of dense vectors. In NLP applications, Faiss enables fast and scalable searches within large vector databases, making it ideal for tasks like document retrieval, nearest neighbor search, and question-answering systems. By leveraging advanced indexing and search algorithms, Faiss provides high performance and accuracy, facilitating the development of applications that require quick access to relevant information from large datasets.

StreamLit: is an open-source Python library used to create web applications for data science and machine learning projects. It allows developers to quickly turn data scripts into interactive and visually appealing web apps with minimal effort. Some key features of Streamlit include: simplicity, widgets, real-time updates, integration with python libraries and deployment. Overall, Streamlit is designed to make it easy for data scientists and machine learning practitioners to create interactive applications and share their work with others.

4. Step by step guide for implementation

In this section, we provide a step-by-step guide to building the application using the technical stack mentioned earlier. The following steps outline the process to achieve fine-tuning using the RAG technique. This implementation is versatile and can be scaled to multiple use cases with minimal code adjustments.

Load the PDF documents

The first step is to collect the source documents which can be used to find the answers from. In this application, I have sample pdf documents in the “data” folder. I am using the “PyPDFDirectoryLoader” class from LangChain to load the documents.

2. Split the texts

Flow diagram of creation of vector store

These documents are then split into smaller text chunks using the RecursiveCharacterTextSplitter. This splitting mechanism ensures that the large documents are broken down into manageable pieces for further processing. By dividing the documents into smaller chunks, the application sets the stage for efficient analysis and retrieval of information.

Below is the function that ensures that PDF documents are read from a specified directory, here the folder is “data” and split into manageable chunks for further processing with the given chunk_size and chunk_overlap. Usually the values we use for chunk_size=10000 and chunk_overlap=10000, also can be different based on the required number of chunks for processing.

def data_ingestion():
 loader = PyPDFDirectoryLoader("data")
 documents = loader.load()
 # Character split works better with this pdf data set
 text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=chunk_size,
  chunk_overlap=chunk_overlap
 )
 docs = text_splitter.split_documents(documents)
 return docs

3. Create Embeddings

Once the documents are split, the next step involves creating embeddings for these text chunks. Creating embeddings is crucial in the RAG technique because embeddings transform text data into dense vector representations that capture semantic meaning. These vectors enable efficient and accurate retrieval of relevant information from large datasets. By leveraging embeddings, the RAG model can enhance the generated responses with precise, contextually appropriate information, significantly improving the quality and relevance of the output.

In our application, we are leveraging the Titan embedding model accessed through the Bedrock service, the code uses the BedrockEmbeddings class to generate embeddings for each chunk. These embeddings capture semantic information about the text, enabling downstream tasks such as similarity search and question answering. The embeddings are then stored in a vector store using the FAISS library. This vector store acts as a repository for the embeddings, facilitating fast and accurate retrieval of relevant text chunks based on user queries.

def get_vector_store(docs):
    vectorstore_faiss = FAISS.from_documents(
        docs,
        bedrock_embeddings
    )
    vectorstore_faiss.save_local("faiss_index")

4. Fine-tuning LLMs

The application offers users the option to fine-tune LLM models for question answering tasks. It provides few models: Mistral, Llama2, Titan Lite and more which are accessed through the AWS Bedrock service. The code initializes these models using the Bedrock class with specific model IDs and parameters. These models are capable of generating responses to user queries based on the provided context and question.

By fine-tuning these models, the application enhances its ability to provide accurate and informative answers to user queries, thereby improving the user experience.

def get_response_llm(llm, vectorstore_faiss, query):
    retriever = vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    )
    question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
    qa = create_retrieval_chain(retriever, question_answer_chain)
    answer = qa.invoke({"input": query, "question": query})
    return answer['answer']

5. User interaction

The core functionality of the application is wrapped within a Streamlit interface, making it intuitive and accessible to users. Users can interact with the application by asking questions related to the PDF documents. They have the flexibility to choose between the LLMs for generating responses to their queries. Additionally, the application provides a convenient option to update the vector store, ensuring that it remains up-to-date with any changes or additions to the document collection.

def main():
    st.set_page_config("Chat  pdf")
    st.header("Chat with pdf using AWS bedrock about Pankaja")

    user_question = st.text_input("Ask a question from the pdf files")

    with st.sidebar:
        st.title("Update or Create Vector Store:")

        if st.button("Vectors Update"):
            with st.spinner("Processing...."):
                docs = data_ingestion()
                get_vector_store(docs)
                st.success("Done")
        if st.button("Llama2 Output"):
            with st.spinner("Processing...."):
                faiss_index = FAISS.load_local("faiss_index", bedrock_embeddings, allow_dangerous_deserialization=True)
                llm = get_llama2_llm()
    
                st.write(get_response_llm(llm, faiss_index, user_question))
                st.success("Done")

5. Use Cases and Results

Below is a screenshot of the user interface, where the user can input a question and click on the model to generate a response from the selected LLM.

In the below screenshot, we provided the question and selected Jurassic model to respond.

We tested few examples by asking specific questions related to the PDF documents used for fine-tuning the LLMs. The responses from the respective LLM are displayed in a tabular format for improved readability.

Example 1

Topic: What is Pankaja’s current location?

Example 2

Topic: What is Pankaja’s highest education?

Example 3

Topic: Where is Pankaja working currently?

From the above examples, it’s evident that different LLMs generate varied responses to the same question. Although the core answers are similar, the presentation style differs depending on the LLM. One notable difference is that Llama3 tends to provide more detailed and elaborative answers compared to Llama2.

Notable Errors

Llama3 model assumed wrong gender of the subject in all of its responses for the above examples.
In the example 3, TitalLite returned incorrect spelling of the subject.

Conclusion:

The fusion of RAG with advanced LLMs marks a significant leap in the development of intelligent Q&A systems tailored to specialized data sets. This article has detailed the practical steps and tools — such as AWS Bedrock, LangChain, Faiss, and Streamlit — required to implement a sophisticated and scalable solution capable of delivering precise and contextually accurate responses.

The fine-tuning process showcased here not only enhances the utility of LLMs but also highlights their adaptability across various domains, from business records to medical data. This approach promises to further enhance the precision and adaptability of the models with minimal training data. Additionally, the framework outlined in this article can be easily scaled and adapted to various use cases, making it a versatile tool for a wide range of applications. Looking ahead, future work would be to focus on quantitatively evaluating model performance and experimenting with different prompt templates to further enhance accuracy and effectiveness.