Develop a Q&A system using ChatGPT and embeddings

Dhanoop Karunakaran
Intro to Artificial Intelligence
3 min readAug 26, 2023
Source: [3]

ChatGPT is a powerful large language model (LLM) trained on a large corpus of data. We can use them for a wide variety of tasks such as asking about the information, summarising etc. One of the limitations of these models is that they are trained on old and publicly available data[1]. What if we need a system that can harness the power of the model like GPT and answer based on recently released data or internal data such as pdfs, process etc.?

Word embeddings. Source:[1]

We utilise ChatGPT and embeddings to demonstrate how to build a Q&A system. The main reason for this approach is that we cannot use ChatGPT to derive insight from the pdf as it has a token limit per request.

Retrieval augmented generation (RAG). Source: [2]

In this project, we are using the concept called retrieval augmented generation (RAG) where an LLM retrieves contextual documents from an external dataset as part of its execution. To implement RAG, we utilises a framework called Langchain which simplifies the creation of application using LLMs.

I have created the Jupyter Notebook to run the code step by step and wrapped it in a docker container for easy installation. Finally, the code is published in the GitHub repo.

Here is the list of steps to implement the Q&A system based on your PDF files

  1. First of all, load the pdf files.
pdf_folder_path = 'pdfs/'
loader = PyPDFDirectoryLoader(pdf_folder_path)
pages = loader.load()

2. The next step is to split the documents into meaningful chunks. We utilise the one of Langchain’s document splitters.

r_splitter = RecursiveCharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separators=["\n\n", "\n", " ", ""]
)
splits = r_splitter.split_documents(pages)

3. Now, we need to store all these splits into storage. A better way to store these splits is by converting them to embeddings and storing them in a vector database. This will enable the semantic search of the data. Chroma (open-source vector database) is being used in this project.

# Create an store the embeddings of the splits in chroma DB
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
persist_directory = 'docs/chroma/'
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory=persist_directory
)
# Similarity search
question = "statistical validation"
docs = vectordb.similarity_search(question,k=3)

# Persist the database
vectordb.persist()

4. We can use LLM to query with pdf. The use of LLM to summarise the relevant splits queried from the vector database.

llm_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_name, temperature=0,openai_api_key=OPENAI_API_KEY)

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

# Run chain
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vectordb.as_retriever(),
return_source_documents=True,
chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

question = "Is probability a class topic?"
result = qa_chain({"query": question})

Using these steps, we can build a Q&A system that can answer based on the content of our PDFs.

If you like my write-up, follow me on Github, Linkedin, and/or Medium profile.

Reference

  1. https://www.ruder.io/word-embeddings-1/
  2. https://www.deeplearning.ai/short-courses/langchain-chat-with-your-data/
  3. https://www.expert.ai/blog/enterprise-llms-of-the-future-bigger-is-not-better/

--

--