Chat with your PDF: Using Langchain, F.A.I.S.S., and OpenAI to Query PDFs
I have recently immersed myself in langchain agents, chains, and word embeddings to enhance my comprehension of creating language model-driven applications. To demonstrate this, today's blog will talk about one such application. In this blog post, we will explore how to build a chat functionality to query a PDF document using Langchain, Facebook A.I. Similarity Search (F.A.I.S.S.), and the OpenAI API. The goal is to create a chat interface where users can ask questions related to the PDF content, and the system will provide relevant answers based on the text in the PDF.
Langchain is a Python library that provides various tools and functionalities for natural language processing (N.L.P.) tasks. It offers text-splitting capabilities, embedding generation, and integration with powerful N.L.P. models like OpenAI's GPT-3.5. F.A.I.S.S., on the other hand, is a library for efficient similarity search and clustering of dense vectors.
In our chat functionality, we will use Langchain to split the PDF text into smaller chunks, convert the chunks into embeddings using OpenAIEmbeddings, and create a knowledge base using F.A.I.S.S. This knowledge base will allow us to perform an efficient similarity search when the user asks a question. We will use OpenAI's GPT-3.5 model through Langchain's L.L.M.S. (Large Language Model as a Service) functionality for question-answering.
Setup and Dependencies
Before diving into the code, let's install the necessary dependencies. We'll be using the following libraries:
dotenv
: For loading the OpenAi Api KeyPyPDF2
: For reading PDF filesstreamlit
: For building the user interfacelangchain
: For text splitting, embeddings, F.A.I.S.S. integration, and question-answeringopenai
: For accessing the OpenAI GPT-3.5 model
To install these dependencies, you can use pip:
pip install python-dotenv PyPDF2 streamlit langchain openai
Implementing the Chat Functionality
from dotenv import load_dotenv
import os
from PyPDF2 import PdfReader
import streamlit as st
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
# Load environment variables
load_dotenv()
Next, let's define a function called process_text
that will split the text into smaller chunks using Langchain's CharacterTextSplitter
and convert the chunks into embeddings using OpenAIEmbeddings
. We will then create a knowledge base using F.A.I.S.S. and return it:
def process_text(text):
# Split the text into chunks using Langchain's CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
# Convert the chunks of text into embeddings to form a knowledge base
embeddings = OpenAIEmbeddings()
knowledgeBase = FAISS.from_texts(chunks, embeddings)
return knowledgeBase
Now, let's define the main function main
where we will build the user interface using Streamlit. We start by setting the title of the app. Next, we add a file uploader where users can upload their PDF documents. If the user has uploaded a PDF, we read its content using PdfReader
from PyPDF2
and store it in the text
variable. We then process the text to create the knowledge base using the process_text
the function we defined earlier. If the user has asked a question, we perform a similarity search on the knowledge base to retrieve relevant documents using the user's query.
def main():
st.title("Chat with your PDF 💬")
pdf = st.file_uploader('Upload your PDF Document', type='pdf')
if pdf is not None:
pdf_reader = PdfReader(pdf)
# Text variable will store the pdf text
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# Create the knowledge base object
knowledgeBase = process_text(text)
query = st.text_input('Ask a question to the PDF')
cancel_button = st.button('Cancel')
if cancel_button:
st.stop()
if query:
docs = knowledgeBase.similarity_search(query)
Next, we initialize the L.L.M.S. (Large Language Model as a Service) using OpenAI's GPT-3.5 model. We then load the question-answering chain using load_qa_chain
from Langchain, specifying the L.L.M.S. instance and the chain type as 'stuff.'Before running the chain, we define a context manager using get_openai_callback
to keep track of the cost incurred by the OpenAI API. Finally, we display the response to the user in the Streamlit app
llm = OpenAI()
chain = load_qa_chain(llm, chain_type='stuff')
with get_openai_callback() as cost:
response = chain.run(input_documents=docs, question=query)
print(cost)
st.write(response)
if __name__ == "__main__":
main()
You can now go ahead and run the application. Upload a pdf and ask your question. For now, query the pdf with topics within the pdf. As a challenge, you can modify the code to include a response when your question is not covered in the pdf. Using the pdf in this Github Repository and the query 'What is neural style transfer?', I get the following output :
In conclusion, we have seen how to implement a chat functionality to query a PDF document using Langchain, F.A.I.S.S., and the OpenAI API. By leveraging text splitting, embeddings, and question-answering capabilities, we can provide users with an interactive chat interface to extract information from PDFs. This approach can be extended and customized based on specific requirements and can be a valuable tool for information retrieval and knowledge extraction from PDF documents.
In the next article, we will explore the concept of vector embeddings and how they can be used in developing multi-document applications(PDFs, docs, …). We will also introduce Pinecone, a vector database that provides fast and efficient similarity search capabilities. By combining vector embeddings and Pinecone, we can build powerful applications that efficiently search and retrieve information from multiple documents.
Till Then!