Custom Question-Answering (QA) Bot: Transforming PDF Interactions with LangChain and ChromaDB
2023–07–02, Johannes Köppern
Welcome to this exciting journey where we will explore the creation of a custom Question-Answering (QA) bot. This bot is designed to interact with a PDF document, transforming the way we retrieve information from such resources. Instead of manually navigating through lengthy and often complex PDF documents, imagine if you could ask the document questions and receive direct, concise answers. That’s the problem our custom QA bot aims to solve, making information retrieval not just efficient, but also interactive and engaging.
In this blog post, we will provide a high-level introduction to building this QA bot, focusing on two key components: Langchain and ChromaDB. Langchain, a Python library, will be used to process the text from our PDF document, making it understandable and accessible for our bot. On the other hand, ChromaDB, a vector store, will help us manage and retrieve the information processed by Langchain in an efficient and organized manner. Together, these tools and OpenAI’s GPT form the backbone of our QA bot, enabling it to understand and respond to user queries accurately and swiftly.
By the end of this post, you will have a basic understanding of how to implement a QA bot and the role of the tools used in its creation. We assume that you already have a basic understanding of OpenAI’s GPT-3.5-turbo and Langchain, as these will be essential components in our bot and will be listed in the prerequisites section.
So, let’s embark on this journey and delve into the fascinating world of QA bots!
Prerequisit
Before we dive into the creation of our custom QA bot, there are a few prerequisites to ensure a smooth and successful journey.
Knowledge Requirements
This tutorial assumes that you have a basic understanding of OpenAI’s GPT-3.5-turbo and Langchain. Familiarity with Python programming and working with PDF documents will also be beneficial.
Software Requirements
You will need to have the following software installed:
– Python (we will be using version 3.9)
– OpenAI’s GPT-3.5-turbo
– Langchain
– ChromaDB
We will be building a console application, so there’s no need for a web framework like Streamlit.
Environment Setup
We will be using Anaconda, a popular Python distribution for scientific computing, to set up our environment. If you haven’t installed Anaconda yet, you can download it from here.
Once you have Anaconda installed, you can create a new environment named „chromadb_qa_bot“ with Python 3.9. Open your terminal and run the following command:
conda create -n chromadb_qa_bot python=3.9
Activate the environment with:
conda activate chromadb_qa_bot
Now, let's install OpenAI, Langchain and ChromaDB using pip:
pip install openai langchain chromadb python-dotenv
And that's it! You have now set up your environment and are ready to start building your custom QA bot.
Also we assume a environment variable OPENAI_API_KEY is set with you OpenAI API key, see this link for example.
Understanding the Tools
Before we start building our custom QA bot, let's take a moment to understand the tools we'll be using: Langchain and ChromaDB. We'll also briefly discuss the concepts of embeddings and vector stores.
Langchain
Langchain is a Python library that plays a crucial role in our QA bot. It is used to process the text from our PDF document, transforming it into a format that our bot can understand and interact with. By converting the raw text into structured data, Langchain allows our bot to accurately interpret user queries and provide relevant responses.
ChromaDB
ChromaDB is a vector store that manages and retrieves the information processed by Langchain. It organizes the data in a way that allows for efficient and accurate retrieval, which is essential for our bot's performance. When a user asks a question, ChromaDB helps the bot find the most relevant information to provide as a response.
Embeddings and Vector Stores
In the context of natural language processing, embeddings are a way of representing text as numerical vectors. This allows us to perform mathematical operations on the text, which is crucial for many tasks, including information retrieval.
A vector store, like ChromaDB, is a system that manages and retrieves these vectors. It organizes the vectors in a way that allows for efficient retrieval, which is particularly important for our QA bot. When a user asks a question, the bot needs to quickly find the most relevant information to provide as a response, and that's where the vector store comes in.
Now that we have a basic understanding of the tools we'll be using, let's move on to designing our QA bot.
Designing the QA Bot
Now that we have a basic understanding of the tools we'll be using, let's delve into the design of our custom QA bot. We'll discuss the bot's architecture, how it interacts with the PDF document, and how it processes user input and provides responses.
Overview of the QA Bot's Architecture
At a high level, our QA bot is structured around three key components: Langchain, ChromaDB, and OpenAI's GPT-3.5-turbo. Langchain processes the text from our PDF document, transforming it into a structured format that our bot can understand. ChromaDB manages and retrieves this structured data, allowing our bot to find the most relevant information in response to user queries. Finally, OpenAI's GPT-3.5-turbo is used to generate human-like responses based on this information.
Interaction with the PDF Document
The bot interacts with the PDF document through Langchain. It uses this Python library to process the raw text from the document, converting it into structured data. This involves extracting meaningful information from the text and representing it as numerical vectors, or embeddings. These embeddings are then stored in ChromaDB, ready to be retrieved when needed.
Processing User Input and Providing Responses
When a user asks a question, the bot first processes the input using Langchain, converting it into an embedding. It then uses ChromaDB to find the most relevant information in response to the query. This information is passed to OpenAI's GPT-3.5-turbo, which generates a human-like response. The bot then delivers this response to the user, completing the interaction.
With this high-level understanding of the bot's design, we're now ready to move on to the implementation.
Implementing the QA Bot
With a solid understanding of the tools and the design of our QA bot, we're now ready to delve into the implementation. In this section, we'll discuss how to build the bot and how to run it.
Building the QA Bot
The first step in implementing our QA bot is to use Langchain to process the text from our PDF document. This involves extracting meaningful information from the text and converting it into numerical vectors, or embeddings. These embeddings are then stored in ChromaDB, ready to be retrieved when needed. So we need these dependencies:
import openai
import os
from dotenv import load_dotenv, find_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
In the next step our PDF file is loaded into the ChromaDB:
# -------------------------------------------------------
# Create vector db/ChromeDB
# -------------------------------------------------------
embeddings = OpenAIEmbeddings()
loader = PyPDFLoader("docs/olivia_rodrigo.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20)
texts = splitter.split_documents(
documents
)
vectordb = Chroma.from_documents(
documents=texts,
embedding=embeddings,
persist_directory="db"
)
vectordb.persist() # This stores he db in the specified folder
The PDF's pages are loaded, the texts are spitted recursively and their tokens are converted to embedding vectors which are stored in the vector DB.
Text splitter: A text splitter is a tool that is used to split large text into smaller chunks. This is often necessary when working with language models, as they are often limited by the amount of text that can be passed to them at once1. By splitting the text into smaller chunks, it becomes possible to process the text in smaller, more manageable pieces.
A regular text splitter is a tool that splits text into smaller chunks based on a specific separator character or string. For example, a regular text splitter might split text into paragraphs by splitting on the newline character "\n"1.
A recursive text splitter, on the other hand, is a more advanced type of text splitter that tries to split text into smaller chunks by recursively trying different separator characters or strings until the chunks are small enough2. For example, a recursive text splitter might first try to split text into paragraphs by splitting on the double newline character "\n\n". If the resulting chunks are still too large, it might then try to split the chunks further by splitting on the single newline character "\n". If the chunks are still too large, it might then try to split them further by splitting on spaces " ", and so on2.
The main advantage of using a recursive text splitter over a regular text splitter is that it can produce more semantically meaningful chunks of text. By trying to keep paragraphs, sentences, and words together for as long as possible, a recursive text splitter can produce chunks of text that are more coherent and easier to understand.
When a user asks a question, the bot processes the input using Langchain, converting it into an embedding. It then uses ChromaDB to find the most relevant information in response to the query. This information is passed to OpenAI's GPT-3.5-turbo, which generates a human-like response.
# -------------------------------------------------------
# Ask questions Ask to the PDF questions
# -------------------------------------------------------
qa = RetrievalQA.from_chain_type(
llm=OpenAI( model_name="gpt-3.5-turbo", temperature=0),
chain_type="stuff",
retriever=vectordb.as_retriever(),
return_source_documents=True,
)
queries = ["Who is Olivia Rodrigo?", "Who is Albert Einstein?", "Who is Albert Einstein? If no andwer is provided wihin the question please answer with you general knowledge"]
for query in queries:
result = qa({"query": query})
print(result["result"])
print(result["source_documents"])
print("-------------------")
In this GitHub repository you can find the entire source code. The bot is in the file app.py
Running this script will output:
Olivia Rodrigo is an actress and singer. Document(page_content='[https://en.wikipedia.org/wiki/Olivia_Rodrigo 1/19Olivia Rodrigo\nRodrigo in 2021', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 0}), Document(page_content='https://en.wikipedia.org/wiki/Olivia_Rodrigo 1/19Olivia Rodrigo\nRodrigo in 2021', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 0}), Document(page_content='https://en.wikipedia.org/wiki/Olivia_Rodrigo 1/19Olivia Rodrigo\nRodrigo in 2021', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 0}), Document(page_content='Olivia Rodrigo (https://www .imdb.com/name/nm71 11120/) at IMDb', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 18})]
-------------------
I don't know. [Document(page_content='mother ."', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 7}), Document(page_content='mother ."', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 7}), Document(page_content='mother ."', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 7}), Document(page_content='Twitter .', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 8})]
-------------------
Albert Einstein was a renowned physicist who developed the theory of relativity and is best known for his equation E=mc². He made significant contributions to the field of physics and is considered one of the greatest scientists of all time. [Document(page_content='"early status as Gen-', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 1}), Document(page_content='"early status as Gen-', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 1}), Document(page_content='"early status as Gen-', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 1}), Document(page_content='mother ."', metadata={'source': 'docs/olivia_rodrigo.pdf', 'page': 7})] -------------------
In the output you can see how the vector store performs a similarity search to identify possible suitable chunks of text which are then fed to the LLM as part of the entire prompt.
Similarity search: A vector store performs a similarity search by finding and retrieving contextually similar information from large collections of structured or unstructured data by transforming it into numerical representations known as vectors or embeddings1. The key idea behind vector search databases is to represent data items (e.g., images, documents, user profiles) as vectors in a high-dimensional space. Similarity between vectors is then measured using a distance metric, such as cosine similarity or Euclidean distance. The goal of a vector search database is to quickly find the most similar vectors to a given query vector.
A word vector can be thought of as a point in a multi-dimensional space, where each dimension represents a particular aspect or characteristic of the word. For example, a word vector for the word “queen” might have high values for dimensions representing “femininity” and “royalty” and low values for dimensions representing “masculinity”. The goal of this process is to enable NLP models to understand the meaning and semantics of different words and their context within a sentence or text.
Testing the QA Bot
I'll discuss testing this project in a separate post. Please see here.
Skipped Deployment Section
We intentionally skipped the deployment section in this post to keep it concise. However, we acknowledge that deployment is a crucial step in making the bot available for users. While we didn't cover it here, you can find resources on deploying similar bots in the GitHub repository, and we may explore this topic in a future post.
Future Work
Looking ahead, there are several ways we could improve or expand our QA bot. We could integrate more advanced natural language processing techniques to improve the bot's understanding of user queries. We could also expand the bot's capabilities to handle more complex queries or to interact with other types of documents. Finally, we could explore deploying the bot on different platforms to make it more accessible to users.
There is also criticism of LangChain in terms of its usefulness. I see the criticism the same way, while at the same time LangChain is suitable for the beginning in working with LLMs, see link 1, link 2. And so it probably makes sense to develop the application without LangChain to become production-ready.
Acknowledgements
Many thanks to the makers of LangChain, to GPT-4 for its help in writing this text and Bing for creating the images in this post.
Final Thoughts
In this post, we delved into the design ane implementation of a custom QA bot. We discussed how the bot uses Langchain to process text from a PDF document, ChromaDB to manage and retrieve this processed information, and OpenAI's GPT-3.5-turbo to generate human-like responses. We also provided a high-level guide on how to implement and test the bot, with detailed code examples and instructions available in the accompanying GitHub repository.