Creating Your Chatbot: Using Your Data with Streamlit and OpenAI 🚀

6 min readDec 10, 2023

In the world of ChatGPT, we get really detailed and impressive answers. But sometimes, these answers might seem true, and we might believe them without knowing if they’re actually correct. This is called “hallucination.” It happens when AI creates answers that seem real but might not be entirely accurate. It’s like a kind of trick where something sounds true, but it might not be entirely right. So, it’s important to check and make sure the information we get from AI is actually reliable before believing it completely

So we can reduce the hallucination by giving own data and get the more accurate result

This technique will help company to get the insight from their long document

Lets create a chatbot using your own data without delay

Essential Libraries for Building Your Own Chatbot

from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import openai
import os 
from dotenv import load_dotenv
import streamlit as st
import sqlite3
from langchain.document_loaders import PyPDFLoader
import os
import time
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pinecone
from langchain.memory import ConversationBufferMemory
import shutil

Architecture Understanding

Before diving into Code, First we need to understand how the Architecture work for getting the insight from your own data

To gain insights from your private data, a systematic process involving five key steps is necessary

1 . Loader

from langchain.document_loaders import PyPDFLoader

# Define the file path
file_path = "path/to/your/file.pdf" or "get link from user"

load=PyPDFLoader(file_path)
loader=load.load()

from langchain.document_loaders import PyPDFLoader: This line imports the PyPDFLoader module, which is designed to handle loading PDF documents for further processing.
file_path = "path/to/your/file.pdf" or "get link from user": Defines a variable file_path that holds the location of the PDF file. It can either be a direct path to the file or a placeholder for a path provided by the user.
load = PyPDFLoader(file_path): Creates an instance of the PyPDFLoader class, passing the file_path as an argument to initialize the loader with the specified PDF file.
loader = load.load(): Invokes the load() method on the PyPDFLoader instance (load) to load the PDF file specified by the file_path. This step actually loads the contents of the PDF file into the loader variable, making it ready for further processing or analysis.

2. Splits

Splitting documents into smaller chunks is important and tricky as we need to maintain meaningful relationships between the chunks. For example, if have 2 chunks on Pakistan’s Agriculture as follows:

chunk 1: on this model. The Pakistan’s Agriculture is top notch quality
chunk 2: with excellent rice and flour

In this case, we did a simple splitting and we ended up with part of the sentence in one chunk, and the other part of the sentence in another chunk. So we will not be able to answer a question about the agriculture product of the Pakistan due to lack of right information in either chunk. So it is important that we split the chunks into semantically relevant chunks.

   
from langchain.text_splitter import RecursiveCharacterTextSplitter

    r_s=RecursiveCharacterTextSplitter(
        chunk_size=1300,
        chunk_overlap=100
    )
    splits=r_s.split_documents(loader)

Instantiate the RecursiveCharacterTextSplitter class, defining parameters for chunk size and overlap

chunk_size defines the size of each chunk, where the document will be split.
chunk_overlap specifies the overlapping characters between consecutive chunks.

A chunk overlap is used to have little overlap between two chunks and this allows for to have some notion of consistency between 2 chunk

The split_documents() method divides the content loaded from the PDF (loader) into smaller chunks based on the specified parameters (chunk_size and chunk_overlap).

want in-dept knowledge how RecursiveCharacterTextSpliiter works then check https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

3. Vector Store And Embeddings:

We split up our document into small chunks and now we need to put these chunks into an index so that we are able to retrieve them easily when we want to answer questions on this document. We use embeddings and vector stores for this purpose

from langchain.vectorstores import Chroma
vectorstore=Chroma.from_documents(
        embedding=embedding,
        documents=splits,
        persist_directory=./"Anyname"

This Python code utilizes the Chroma module from langchain.vectorstores to generate a vector store. Chroma is a in-memory vectorage where it is store in RAM and it is light wieghted. The Chroma.from_documents method takes in several parameters:

embedding: This parameter represents the method or technique used for converting text data into numerical vectors.
documents: Refers to the collection of text or documents that will be transformed into vectors.
persist_directory: Specifies the directory where the resulting vector store will be stored or saved for future use.

4. RETREIVAL

Retrieval is the main process of getting the right information from the document then pass to the llm model, if the getting information is not good then model will not give great results ,Most of the time when our question answering fails, it is due to a mistake in retrieval process. Some advanced retrieval mechanisms in LangChain such as, Self-query and Contextual Compression. Retrieval is important at query time when a query comes in and we want to retrieve the most relevant splits.

We will not discussed the under-hood working of retrieval of mechanisms to get insight its getting the right peice of splits from document and pass to llm and llm

question = "what did they say about regression in the third lecture?"

docs = vectorstore.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

vectorstore: Represents a vector database where documents or vectors are stored.
similarity_search: This method/function is used to find similar documents or vectors to the given query.
question: Refers to the query for which similar documents are being searched.
k=3: Specifies that the search should retrieve the top 3 most similar documents or vectors.
filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}: This filter parameter restricts the search to documents that match a specific source or location, in this case, limiting the search to the document MachineLearning-Lecture03.pdf under the directory docs/cs229_lectures

The docs will contain the three best splits from lecture 3 in which professor talk regression

5. OUTPUT

After retrieving relevant information from the retrieval process, we then feed this gathered information along with a specific question into an LLM (Language Model) for analysis. The LLM processes the question and the retrieved information to generate an answer based on the context provided by the question.

CODE

from langchain.chains import RetrievalQA

Output=RetrievalQA.from_chain_type(
        chain_type='stuff',
        retriever=vectorstore.as_retriever(),
        llm=llm,
        verbose=True,
        input_key="Question"
)

Explanation:

RetrievalQA: It appears to be a class or method used for setting up a question-answering system that involves retrieval-based methods.
from_chain_type: This function or method likely initializes a question-answering system based on a specified chain type or configuration.
chain_type='stuff': Specifies the type or configuration of the question-answering system.
retriever=vectorstore.as_retriever(): Utilizes vectorstore (presumably a vector database or retrieval system) by converting it into a retriever, which is used to retrieve relevant information based on the input question.
llm: Represents the Language Model (LLM) that will process and generate answers based on the retrieved information and the input question.
verbose=True: Indicates whether the system will output detailed information during the question-answering process. When set to True, it typically displays more information.
input_key="Question": Specifies the key or identifier for the input question within the data that will be processed by the question-answering system.

We discussed how to use LangChain to load data from a variety of documents. We also learnt to split the documents into chunks. After that, we created embeddings for these chunks and these into a vector store. Later, we did a semantic search using this vector store. Then, we covered retrieval, where we talked about various retrieval algorithms. We combined retrieval with LLMs in Question Answering, where we take the retrieved documents and the user question and pass them to an LLM to generate an answer to the question we asked.