Creating Your Chatbot: Using Your Data with Streamlit and OpenAI 🚀
In the world of ChatGPT, we get really detailed and impressive answers. But sometimes, these answers might seem true, and we might believe them without knowing if they’re actually correct. This is called “hallucination.” It happens when AI creates answers that seem real but might not be entirely accurate. It’s like a kind of trick where something sounds true, but it might not be entirely right. So, it’s important to check and make sure the information we get from AI is actually reliable before believing it completely
So we can reduce the hallucination by giving own data and get the more accurate result
This technique will help company to get the insight from their long document
Lets create a chatbot using your own data without delay
Essential Libraries for Building Your Own Chatbot
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import openai
import os
from dotenv import load_dotenv
import streamlit as st
import sqlite3
from langchain.document_loaders import PyPDFLoader
import os
import time
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pinecone
from langchain.memory import ConversationBufferMemory
import shutil
Architecture Understanding
Before diving into Code, First we need to understand how the Architecture work for getting the insight from your own data
To gain insights from your private data, a systematic process involving five key steps is necessary
1 . Loader
from langchain.document_loaders import PyPDFLoader
# Define the file path
file_path = "path/to/your/file.pdf" or "get link from user"
load=PyPDFLoader(file_path)
loader=load.load()
from langchain.document_loaders import PyPDFLoader
: This line imports thePyPDFLoader
module, which is designed to handle loading PDF documents for further processing.file_path = "path/to/your/file.pdf" or "get link from user"
: Defines a variablefile_path
that holds the location of the PDF file. It can either be a direct path to the file or a placeholder for a path provided by the user.load = PyPDFLoader(file_path)
: Creates an instance of thePyPDFLoader
class, passing thefile_path
as an argument to initialize the loader with the specified PDF file.loader = load.load()
: Invokes theload()
method on thePyPDFLoader
instance (load
) to load the PDF file specified by thefile_path
. This step actually loads the contents of the PDF file into theloader
variable, making it ready for further processing or analysis.
2. Splits
Splitting documents into smaller chunks is important and tricky as we need to maintain meaningful relationships between the chunks. For example, if have 2 chunks on Pakistan’s Agriculture as follows:
chunk 1: on this model. The Pakistan’s Agriculture is top notch quality
chunk 2: with excellent rice and flour
In this case, we did a simple splitting and we ended up with part of the sentence in one chunk, and the other part of the sentence in another chunk. So we will not be able to answer a question about the agriculture product of the Pakistan due to lack of right information in either chunk. So it is important that we split the chunks into semantically relevant chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
r_s=RecursiveCharacterTextSplitter(
chunk_size=1300,
chunk_overlap=100
)
splits=r_s.split_documents(loader)
Instantiate the RecursiveCharacterTextSplitter
class, defining parameters for chunk size and overlap
chunk_size
defines the size of each chunk, where the document will be split.chunk_overlap
specifies the overlapping characters between consecutive chunks.
A chunk overlap is used to have little overlap between two chunks and this allows for to have some notion of consistency between 2 chunk
- The
split_documents()
method divides the content loaded from the PDF (loader
) into smaller chunks based on the specified parameters (chunk_size
andchunk_overlap
).
want in-dept knowledge how RecursiveCharacterTextSpliiter
works then check https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
3. Vector Store And Embeddings:
We split up our document into small chunks and now we need to put these chunks into an index so that we are able to retrieve them easily when we want to answer questions on this document. We use embeddings and vector stores for this purpose
from langchain.vectorstores import Chroma
vectorstore=Chroma.from_documents(
embedding=embedding,
documents=splits,
persist_directory=./"Anyname"
This Python code utilizes the Chroma
module from langchain.vectorstores
to generate a vector store. Chroma
is a in-memory vectorage where it is store in RAM and it is light wieghted. The Chroma.from_documents
method takes in several parameters:
embedding
: This parameter represents the method or technique used for converting text data into numerical vectors.documents
: Refers to the collection of text or documents that will be transformed into vectors.persist_directory
: Specifies the directory where the resulting vector store will be stored or saved for future use.
4. RETREIVAL
Retrieval is the main process of getting the right information from the document then pass to the llm model, if the getting information is not good then model will not give great results ,Most of the time when our question answering fails, it is due to a mistake in retrieval process. Some advanced retrieval mechanisms in LangChain such as, Self-query and Contextual Compression. Retrieval is important at query time when a query comes in and we want to retrieve the most relevant splits.
We will not discussed the under-hood working of retrieval of mechanisms to get insight its getting the right peice of splits from document and pass to llm and llm
question = "what did they say about regression in the third lecture?"
docs = vectorstore.similarity_search(
question,
k=3,
filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)
vectorstore
: Represents a vector database where documents or vectors are stored.similarity_search
: This method/function is used to find similar documents or vectors to the given query.question
: Refers to the query for which similar documents are being searched.k=3
: Specifies that the search should retrieve the top 3 most similar documents or vectors.filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
: This filter parameter restricts the search to documents that match a specific source or location, in this case, limiting the search to the documentMachineLearning-Lecture03.pdf
under the directorydocs/cs229_lectures
The docs will contain the three best splits from lecture 3 in which professor talk regression
5. OUTPUT
After retrieving relevant information from the retrieval process, we then feed this gathered information along with a specific question into an LLM (Language Model) for analysis. The LLM processes the question and the retrieved information to generate an answer based on the context provided by the question.
CODE
from langchain.chains import RetrievalQA
Output=RetrievalQA.from_chain_type(
chain_type='stuff',
retriever=vectorstore.as_retriever(),
llm=llm,
verbose=True,
input_key="Question"
)
Explanation:
RetrievalQA
: It appears to be a class or method used for setting up a question-answering system that involves retrieval-based methods.from_chain_type
: This function or method likely initializes a question-answering system based on a specified chain type or configuration.chain_type='stuff'
: Specifies the type or configuration of the question-answering system.retriever=vectorstore.as_retriever()
: Utilizesvectorstore
(presumably a vector database or retrieval system) by converting it into a retriever, which is used to retrieve relevant information based on the input question.llm
: Represents the Language Model (LLM) that will process and generate answers based on the retrieved information and the input question.verbose=True
: Indicates whether the system will output detailed information during the question-answering process. When set toTrue
, it typically displays more information.input_key="Question"
: Specifies the key or identifier for the input question within the data that will be processed by the question-answering system.
We discussed how to use LangChain to load data from a variety of documents. We also learnt to split the documents into chunks. After that, we created embeddings for these chunks and these into a vector store. Later, we did a semantic search using this vector store. Then, we covered retrieval, where we talked about various retrieval algorithms. We combined retrieval with LLMs in Question Answering, where we take the retrieved documents and the user question and pass them to an LLM to generate an answer to the question we asked.