Ask Any Book (LangChain + OpenAI ChatGPT)
Greetings, fellow nerds! Today, I’m going to explain you how to build your very own AI-based question and answering system on top of any PDF content.
Why, you ask? Because why not! Plus, you’ll impress all your friends with your AI wizardry. So, let’s get started.
Prerequisites
Before we start building the system, we need to make sure that we have the following prerequisites installed on our machine:
- Python 3.x
- pip package manager
Installation
Let’s start by creating a new Python virtual environment and installing the required dependencies:
python3 -m venv env
source env/bin/activate
pip install openai langchain pinecone-client
Unstructured File Loader
Thanks to LangChain and its Unstructured File Loader again, which covers how to load almost all types of content from files.
Don’t forget to install other dependencies, depending upon your use-case
https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/unstructured_file.html
Extract the Book Content
In this step, we’ll extract the text content from the PDF file that we want to use for our question answering system.
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = UnstructuredPDFLoader("my-fav-book.pdf")
data = loader.load()
print (f'Total {len(data)} document(s) found, having {len(data[0].page_content)} characters in your document')
Total 1 document(s) found, having 256101 characters in your document
Split Book into Smaller Chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(data)
print (f'new splitted {len(texts)} documents')
new splitted 129 documents
Build Semantic Index
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
# initialize pinecone
pinecone.init(
api_key=PINECONE_API_KEY, # find at app.pinecone.io
environment=PINECONE_API_ENV # next to api key in console
)
index_name = "langchain-exp"
namespace = "book"
docsearch = Pinecone.from_texts(
[t.page_content for t in texts], embeddings,
index_name=index_name, namespace=namespace)
Actual Fun and Test Part — QnA
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")
query = "Who are you ?"
docs = docsearch.similarity_search(query,
include_metadata=True, namespace=namespace)
chain.run(input_documents=docs, question=query)
Im the narrator of your favourite book. Ask me anything from it
Keep on asking more questions untill your OpenAI credits lasts..:)
Resume Question and Answering
If your index is already built and persisted in pinecone, please reuse it like this:-
index_name = "langchain-exp"
namespace = "book"
docsearch = Pinecone.from_existing_index(index_name, embeddings, namespace=namespace)
In conclusion, building an AI-based question and answering system over any PDF content is a fascinating and rewarding task. Thanks to powerful tools such as OpenAI, LangChain, and Pinecone, it has become easier than ever to extract meaning and insights from text data. By following the step-by-step process we have outlined in this article, you can leverage these technologies to create a Q&A system that can help users find the information they need quickly and accurately. With further experimentation and customization, you can also enhance the system’s performance and make it even more useful for a wide range of applications.
I hope that this article has provided you with valuable insights and inspiration for your own AI projects.