Chat with your PDF: Using Langchain, F.A.I.S.S., and OpenAI to Query PDFs

johnthuo
4 min readJun 4, 2023

I have recently immersed myself in langchain agents, chains, and word embeddings to enhance my comprehension of creating language model-driven applications. To demonstrate this, today's blog will talk about one such application. In this blog post, we will explore how to build a chat functionality to query a PDF document using Langchain, Facebook A.I. Similarity Search (F.A.I.S.S.), and the OpenAI API. The goal is to create a chat interface where users can ask questions related to the PDF content, and the system will provide relevant answers based on the text in the PDF.

Langchain is a Python library that provides various tools and functionalities for natural language processing (N.L.P.) tasks. It offers text-splitting capabilities, embedding generation, and integration with powerful N.L.P. models like OpenAI's GPT-3.5. F.A.I.S.S., on the other hand, is a library for efficient similarity search and clustering of dense vectors.

In our chat functionality, we will use Langchain to split the PDF text into smaller chunks, convert the chunks into embeddings using OpenAIEmbeddings, and create a knowledge base using F.A.I.S.S. This knowledge base will allow us to perform an efficient similarity search when the user asks a question. We will use OpenAI's GPT-3.5 model through Langchain's L.L.M.S. (Large Language Model as a Service) functionality for question-answering.

Setup and Dependencies
Before diving into the code, let's install the necessary dependencies. We'll be using the following libraries:

  • dotenv: For loading the OpenAi Api Key
  • PyPDF2: For reading PDF files
  • streamlit: For building the user interface
  • langchain: For text splitting, embeddings, F.A.I.S.S. integration, and question-answering
  • openai: For accessing the OpenAI GPT-3.5 model

To install these dependencies, you can use pip:

pip install python-dotenv PyPDF2 streamlit langchain openai

Implementing the Chat Functionality

from dotenv import load_dotenv
import os
from PyPDF2 import PdfReader
import streamlit as st
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

# Load environment variables
load_dotenv()

Next, let's define a function called process_text that will split the text into smaller chunks using Langchain's CharacterTextSplitter and convert the chunks into embeddings using OpenAIEmbeddings. We will then create a knowledge base using F.A.I.S.S. and return it:

def process_text(text):
# Split the text into chunks using Langchain's CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)

# Convert the chunks of text into embeddings to form a knowledge base
embeddings = OpenAIEmbeddings()
knowledgeBase = FAISS.from_texts(chunks, embeddings)

return knowledgeBase

Now, let's define the main function main where we will build the user interface using Streamlit. We start by setting the title of the app. Next, we add a file uploader where users can upload their PDF documents. If the user has uploaded a PDF, we read its content using PdfReader from PyPDF2 and store it in the text variable. We then process the text to create the knowledge base using the process_text the function we defined earlier. If the user has asked a question, we perform a similarity search on the knowledge base to retrieve relevant documents using the user's query.

def main():
st.title("Chat with your PDF 💬")

pdf = st.file_uploader('Upload your PDF Document', type='pdf')

if pdf is not None:
pdf_reader = PdfReader(pdf)
# Text variable will store the pdf text
text = ""
for page in pdf_reader.pages:
text += page.extract_text()

# Create the knowledge base object
knowledgeBase = process_text(text)

query = st.text_input('Ask a question to the PDF')
cancel_button = st.button('Cancel')

if cancel_button:
st.stop()

if query:
docs = knowledgeBase.similarity_search(query)

Next, we initialize the L.L.M.S. (Large Language Model as a Service) using OpenAI's GPT-3.5 model. We then load the question-answering chain using load_qa_chain from Langchain, specifying the L.L.M.S. instance and the chain type as 'stuff.'Before running the chain, we define a context manager using get_openai_callback to keep track of the cost incurred by the OpenAI API. Finally, we display the response to the user in the Streamlit app

            llm = OpenAI()
chain = load_qa_chain(llm, chain_type='stuff')

with get_openai_callback() as cost:
response = chain.run(input_documents=docs, question=query)
print(cost)

st.write(response)


if __name__ == "__main__":
main()

You can now go ahead and run the application. Upload a pdf and ask your question. For now, query the pdf with topics within the pdf. As a challenge, you can modify the code to include a response when your question is not covered in the pdf. Using the pdf in this Github Repository and the query 'What is neural style transfer?', I get the following output :

Query Output

In conclusion, we have seen how to implement a chat functionality to query a PDF document using Langchain, F.A.I.S.S., and the OpenAI API. By leveraging text splitting, embeddings, and question-answering capabilities, we can provide users with an interactive chat interface to extract information from PDFs. This approach can be extended and customized based on specific requirements and can be a valuable tool for information retrieval and knowledge extraction from PDF documents.

In the next article, we will explore the concept of vector embeddings and how they can be used in developing multi-document applications(PDFs, docs, …). We will also introduce Pinecone, a vector database that provides fast and efficient similarity search capabilities. By combining vector embeddings and Pinecone, we can build powerful applications that efficiently search and retrieve information from multiple documents.

Till Then!

--

--

johnthuo

I am a Computer Science student interested in Machine Learning, Cloud Development, Data Engineering & IOT