Chatting with Textbooks: The Future of Learning with RAG Chatbots and Streamlit

Published in

Nybles

9 min readJul 16, 2024

In this article, I will demonstrate how to develop a chatbot capable of conversing with textbooks using RAG (retrieval augmented generation). This application will be designed to operate seamlessly on both local machines and via the Hugging Face API. Additionally, I will utilize Streamlit to build an intuitive user interface (UI) for this project.

For running LLM on local machine using CPU/GPU we will be using LM studio: https://lmstudio.ai/. Go click on this link and finish setting up.

Some important terminologies:

LLM hallucination: LLM hallucinations are the events in which ML models, produce outputs that are coherent and grammatically correct but factually incorrect or nonsensical
Knowledge base: Knowledge base is a centralized repository of information that stores data, facts, rules, and relationships about a particular domain or subject.
Vector database: Vector database is a specialized type of database designed to store, manage, and query high-dimensional vectors efficiently.

What is RAG?

RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs’ generative process.

Due to its lack of training on certain data, LLMs can sometimes produce results that are factually incorrect or nonsensical, known as LLM hallucinations. To address this, we will utilize RAG to construct an external knowledge base using input documents. By comparing user inputs for similarity, we can retrieve relevant contextual information that may not be covered in LLM training.

How chatbot build with RAG are good over traditional chatbot?

Chatbots built with RAG offer significant improvements over traditional chatbots by combining the strengths of retrieval-based and generative models. Traditional chatbots typically rely on predefined responses or purely generative models that may struggle with factual accuracy and consistency leading to LLM hallucination. In contrast, RAG-based chatbots can retrieve relevant information from extensive datasets and generate responses that are both accurate and contextually relevant. This hybrid approach allows RAG chatbots to handle a wider range of queries with higher precision, provide more informative and dynamic interactions, and better understand and respond to user needs. This makes RAG-based chatbots particularly effective for applications requiring detailed, factual, and contextually nuanced responses.

Now lets see what are the use cases of crating such application and then walkthrough the code base.

Personal Learning Assistant: Users can locally load their own documents and engage in interactive chats, facilitating better understanding and comprehension.
Educational Institutions: A SaaS platform can be developed for colleges and universities. This platform enables efficient reading and studying from vast collections of textbooks and documents, significantly reducing time spent on research.
Customer Support: Businesses can integrate the technology to provide instant and accurate responses to customer queries, improving customer satisfaction and response times.

Before delving into the code, let’s first examine the model architecture we will be constructing. This blueprint will serve as our guide throughout the development process.

Here’s how it will function: Text will be extracted from input textbooks or documents and compiled into a new PDF document. This PDF will then be used to create an external knowledge base using the FAISS vector database. This database will act as a retriever, providing relevant context for user queries. By combining this context with the user’s input, the LLM (locally using LM studio and via the Hugging Face API) will generate and present its output.

Let’s start coding!

Extracting text from textbooks/documents.

Since textbooks contain various elements such as images and tables in addition to text, our focus will be specifically on extracting text objects. This approach ensures that we accurately capture and utilize textual content for further processing and analysis.

Required libraries:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
from fpdf import FPDF

FPDF is used to create and manipulate PDF documents programmatically

pdfminer is used to extract and analyze text and metadata from existing PDF documents.

## Create a function to extract text from PDF
def text_extraction(element):
    # Extracting the text from the in-line text element
    line_text = element.get_text()

    return line_text

We use ‘get_text’ to extract text in element. From next code snippet you will know more about element.

# Iterate over each page in textbook and append all the extracted text in a txt file.
for pagenum, page in enumerate(extract_pages(file_path)):
    
    page_content = []
    
    # Find all the elements
    page_elements = [(element.y1, element) for element in page._objs]
    # Sort all the elements as they appear in the page 
    page_elements.sort(key=lambda a: a[0], reverse=True)
    
    # Find the elements that composed a page
    for i,component in enumerate(page_elements):
        # Extract the element of the page layout
        element = component[1]
        
        # Check if the element is a text element
        if isinstance(element, LTTextContainer):
            # Use the function to extract the text and format for each text element
            line_text = text_extraction(element)
            page_content.append(line_text)

    # Join each string in page_content
    result = ''.join(page_content)
    
    # Open the file in append mode and write the result
    with open("path to txt file", 'a', encoding='utf-8') as file:
        file.write(result.replace("/n",''))

In the function snippet described above, we iterate through each page of the textbook or document. For each page, we create an array containing different elements such as text, images, and tables present on that page. We then iterate over these page elements to identify instances of ‘LTTextContainer’, which indicates a text element, and extract its text content.

# Converting txt file to pdf
pdf = FPDF()
with open("path_to_txt_file", 'r',encoding = "utf-8") as f:
    # Start a new page
    pdf.add_page()
    # Set the font and font size
    pdf.set_font('Arial', size=8)
    # Read and write the text in chunks
    chunk_size = 1024*1024  # Adjust the chunk size as needed
    while True:
        chunk = f.read(chunk_size)
        if not chunk:
            break
        pdf.write(5, chunk.encode('latin-1', 'replace').decode('latin-1'))

# Save the PDF with the same name as the text file
pdf_output_path = "path_to_pdf_file.pdf"
pdf.output(pdf_output_path)

Once a text file of textbook/documents is created, we convert it into pdf document. We need PDF file for further processing.

Creating a vector database.

Once we have gathered all the knowledge from the textbook, our next step is to store this data in a format that allows for efficient retrieval of similar information based on user prompts. One effective method for storing this data is through vector embeddings. Vector embeddings are numerical representations of the textual data that capture the semantic meaning, making it easier to find and retrieve relevant information. To manage and utilize these embeddings, we will use a vector database.

Among the various vector databases available, such as FAISS, Chroma, and Milvus, we will be using FAISS for the following reasons:

It offers efficient similarity search capabilities.
It is highly optimized for performance on CPUs, ensuring fast and reliable results.

Required libraries:

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter 
import langchain.schema.document as document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

We will be using the LangChain library to create our vector database because it offers a comprehensive suite of tools and functionalities, including various vector stores and embedding libraries. LangChain simplifies the integration process by providing all the necessary components in a unified framework, making it easier to develop and deploy our application.

def create_vectordb(file_name):
    # Load PDF files from the specified directory using DirectoryLoader and PyPDFLoader
    loader = DirectoryLoader("path_to_file's_folder", glob=file_name,loader_cls=PyPDFLoader)
    documents = loader.load()
  
    # Split the loaded documents into smaller chunks suitable for embedding
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1800,chunk_overlap=300)
    texts = text_splitter.split_documents(documents)
  
    # Select the embedding model to use for creating vector embeddings
    embeddings = HuggingFaceEmbeddings(
                    model_name='sentence-transformers/all-MiniLM-L6-v2',
                    model_kwargs={'device': 'cpu'}
                    )
    
    #Creating vector database
    db = FAISS.from_documents(texts, embeddings)
    
    db.save_local("path_of_vector_database")

RecursiveCharecterTextSplitter is designed to divide large text documents into smaller, manageable chunks. This is particularly important for tasks like embedding text into vectors, where smaller chunks can be more effectively processed and stored. By having an overlap between chunks, the splitter helps to preserve the context across chunk boundaries.

Now we have successfully created and saved a vector database containing our knowledge base in for of vectors.

Interacting with the Chatbot

In this section, I will finally build the chatbot. The previous two sections covered the prerequisites necessary for this process.

Once the knowledge extraction and database creation are completed, we can proceed to build the chatbot. This chatbot will utilize the context from the knowledge base to answer user questions. As outlined in the title, we will ensure that the chatbot can run both on a local machine and via an API.

Creating a retrieval system: Same for both approches.

embeddings = HuggingFaceEmbeddings(model_name='Embedding_model',model_kwargs={'device': 'cpu'})
db = FAISS.load_local("path_to_saved_vector_database", embeddings, allow_dangerous_deserialization=True)
retriever = db.as_retriever()
docs = retriever.invoke(user_message)
context = ""; count = 0

# Iterate through the retrieved documents, accumulating their content
for i in range(len(docs)):
    context += docs[i].page_content
    count+=1
    # Consider only the top 2 retrieved contexts
    if count==2: break

With the help of as_retriever method, we can effectively do similarity search with respect to user_message and retrieve context. From above code snippet we are extracting top two retrieved contexts.

Running chatbot on local machine

Required libraries:

import openai
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

Connecting to LM studio

#You will find this in LM studio application 
# And might differ from what is stated below.
openai.api_base = "http://localhost:1234/v1"
openai.api_key = "not-needed"

completion = openai.ChatCompletion.create(
    model="local-model", # this field is currently unused
    messages = message_history,
    temperature=0.6,
    stream=True,
    )

There are various ways to save message/chat history. One I used here is:

message_history = []

#user input
message_history.append({"role": "user", "content": context + "\n" + user_message})

#LLM output
message_history.append(new_message)

Generating output:

new_message = {"role": "assistant", "content": ""}
answer = ''
for chunk in completion:
    if chunk.choices[0].get("delta", {}).get("content"):
        new_message["content"] += chunk.choices[0].delta.content
        
print(new_message["content"])

2. Let’s build this chatbot using the Hugging Face API.

Required libraries:

from langchain_community.llms import HuggingFaceEndpoint
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

LLMChain: This library is the core and handles interacting with the LLM itself. It takes care of sending prompts and receiving the LLM’s responses.

ConversationBufferMemory: This library helps the system remember past interactions. It stores conversation history, allowing the LLM to access previous messages and respond in a more coherent way.

PromptTemplate: It lets you define a question or instruction with blanks, and then you can fill those blanks with specific information for each conversation, making your prompts more dynamic and adaptable.

Setup:

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "api_token"
llm = HuggingFaceEndpoint(
    repo_id= selected_model, 
    model_kwargs={"max_length": 512},
    temperature = 0.5,
    max_new_tokens = 216,
    )

memory = ConversationBufferMemory()

prompt = PromptTemplate(
    input_variables=["Student"],
    template='''
        You're the teacher. Dont be sarcastic 
        and answer the following question asked by student: {Student}
        '''
     )

conversation = LLMChain(
    llm=llm,
    prompt=prompt,
    memory=memory
     )

Output generation:

pred = conversation.invoke(input = user_message)

Creating a streamlit interface for our chatbot.

Running applications in the terminal can often be tedious. To make interacting with our chatbot more engaging, let’s create a user-friendly interface using the Streamlit library. Streamlit provides a variety of tools to help build interactive web applications. For more information, visit: Streamlit.

User Interface for our chatbot — User Interface For Our Chatbot

Required libraries:

import streamlit as st

# Custom functions created in the initial steps
from pdf2text import required_pdf  # Function for converting textbooks to a PDF knowledge base
from vectordb import create_vectordb  # Function for converting the PDF knowledge base to a vector database

In Streamlit, applications rerun from scratch after every user interaction. This means variables defined within the code are reset with each rerun. Session state variables solve this by providing a way to store data across these reruns, specific to each user session. So we now create a session state variable to store chat memory.

if "message_history" not in st.session_state:
    st.session_state.message_history= [
             {"role": "system", 
              "content": '''You are an intelligent assistant. 
                   You always provide well-reasoned answers 
                   that are both correct and helpful.'''},
              ]

Let’s create a sidebar for selecting documents, creating a vector database, and adding a new chat button.

with st.sidebar:

    # Select textbook/document
    selected_pdf = st.selectbox("Book name: ", [list of documents])
    click1 = st.button(":orange[Create Requirements]")
    if click1: 
        required_txt(f"{selected_pdf}.pdf")
        create_vectordb(f"required_{selected_pdf}.pdf")
    st.divider()
    
    # New chat
    click2 = st.button(":orange[New Chat]")
    if click2: 
        st.session_state.message_history= [
               {"role": "system", 
               "content": "You are an intelligent assistant. You always provide well-reasoned answers that are both correct and helpful."},
               ]

Let’s build a chat system for sending and receiving messages.

user_message = st.chat_input("You:", key="user_message")

if user_message:
    # Retrival
    '''
    Reteriver code from step 3
    '''

    st.session_state.message_history.append({"role": "user", "content": context + "\n" + user_message})
    
    # Generation
    '''
    Setup LLM for generating output from step 3
    '''
    
    # Saving output        
    st.session_state.message_history.append(new_message)

Checkout my Github for complete source code!

Conclusion!

In conclusion, building a RAG-based chatbot to interact with textbooks offers a powerful way to enhance learning and information retrieval. By leveraging advanced technologies like vector databases and embedding models, we can create a robust system that provides accurate and contextually relevant responses. Additionally, integrating Streamlit for the user interface ensures that the chatbot is not only functional but also user-friendly and engaging. This approach has the potential to revolutionize how we interact with large volumes of text, making it easier to access and comprehend complex information. I hope this article has provided valuable insights and practical steps for creating your own intelligent chatbot. Thank you for following along, and happy coding!

About author.

I am undergrad student pursuing IT at Indian Institute of Information Technology — Allahabad (IIIT-A). Exploring fascinating world of Machine Learning and Artificial Intelligence. You can connect me at LinkedIn.

Chatting with Textbooks: The Future of Learning with RAG Chatbots and Streamlit

Some important terminologies:

What is RAG?

How chatbot build with RAG are good over traditional chatbot?

Now lets see what are the use cases of crating such application and then walkthrough the code base.

Let’s start coding!

Extracting text from textbooks/documents.

Creating a vector database.

Interacting with the Chatbot

Creating a streamlit interface for our chatbot.

Conclusion!

About author.

Written by Tanay Falor