Ask Any Book (LangChain + OpenAI ChatGPT)

3 min readApr 5, 2023

Greetings, fellow nerds! Today, I’m going to explain you how to build your very own AI-based question and answering system on top of any PDF content.

Why, you ask? Because why not! Plus, you’ll impress all your friends with your AI wizardry. So, let’s get started.

Prerequisites

Before we start building the system, we need to make sure that we have the following prerequisites installed on our machine:

Python 3.x
pip package manager

Installation

Let’s start by creating a new Python virtual environment and installing the required dependencies:

python3 -m venv env
source env/bin/activate
pip install openai langchain pinecone-client

Unstructured File Loader

Thanks to LangChain and its Unstructured File Loader again, which covers how to load almost all types of content from files.

Don’t forget to install other dependencies, depending upon your use-case

https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/unstructured_file.html

Extract the Book Content

In this step, we’ll extract the text content from the PDF file that we want to use for our question answering system.

from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = UnstructuredPDFLoader("my-fav-book.pdf")
data = loader.load()
print (f'Total {len(data)} document(s) found, having {len(data[0].page_content)} characters in your document')

Total 1 document(s) found, having 256101 characters in your document

Split Book into Smaller Chunks

text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

print (f'new splitted {len(texts)} documents')

new splitted 129 documents

Build Semantic Index

from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV # next to api key in console
)
index_name = "langchain-exp"
namespace = "book"

docsearch = Pinecone.from_texts(
  [t.page_content for t in texts], embeddings,
  index_name=index_name, namespace=namespace)

Actual Fun and Test Part — QnA

from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

query = "Who are you ?"
docs = docsearch.similarity_search(query,
  include_metadata=True, namespace=namespace)

chain.run(input_documents=docs, question=query)

Im the narrator of your favourite book. Ask me anything from it

Keep on asking more questions untill your OpenAI credits lasts..:)

Resume Question and Answering

If your index is already built and persisted in pinecone, please reuse it like this:-

index_name = "langchain-exp"
namespace = "book"
docsearch = Pinecone.from_existing_index(index_name, embeddings, namespace=namespace)

In conclusion, building an AI-based question and answering system over any PDF content is a fascinating and rewarding task. Thanks to powerful tools such as OpenAI, LangChain, and Pinecone, it has become easier than ever to extract meaning and insights from text data. By following the step-by-step process we have outlined in this article, you can leverage these technologies to create a Q&A system that can help users find the information they need quickly and accurately. With further experimentation and customization, you can also enhance the system’s performance and make it even more useful for a wide range of applications.

I hope that this article has provided you with valuable insights and inspiration for your own AI projects.