Building an LLM-Powered Question Answering system for private documents on Discord
Author: Moeed Iftikhar at DreamAI Software (Pvt) Ltd
In the realm of Natural Language Processing (NLP), constructing a Question and Answer (Q&A) system presents an exciting challenge. This tutorial will illustrate how to develop a robust Q&A system using open-source tools and components including LangChain, GPT-3 (ChatGPT), OpenAI Embeddings, and FAISS Vector Index, and integrate the system as a Bot within a Discord application. This system empowers users to query their own PDF documents and receive precise answers. Let’s delve into the steps to craft this potent tool.
Prerequisites
- Basic knowledge of Python programming.
- Familiarity with NLP concepts.
- Python installed on your computer.
- Access to LangChain, GPT-3 (ChatGPT), OpenAI Embeddings, and FAISS API keys.
- Discord developer account and a bot token.
Step 1: Set Up Your Development Environment
Before we begin, make sure you have all the necessary tools and API keys for the following:
- Langchain
- GPT-3 (ChatGPT) API Key
- FAISS Python library (Install it using pip install faiss)
Step 2: Create a Discord Bot
Follow these instructions to create a Discord Bot:
- Get the key from the OpenAI website to integrate chatbot functionality into our code. Visit OpenAI to get the key for chatgpt.
- Open Discord in your browser or the desktop application.
- The Discord interface has a left sidebar displaying your current servers and messages.
- Locate the “Add a Server” button, which typically appears as a small plus (+) icon on the left sidebar.
- Click the “Add a Server” button to initiate the server creation process. A dialog box will appear.
- Select “Create My Own” in the dialog box to create a new server from scratch.
- Discord may ask you some questions regarding the purpose of your server. If you don’t require any specific options, you can simply click “skip this question.”
- In the blank name field that appears, enter the desired name for your server. For this project, let’s name it “Test Server”.
- Go to the official Discord website and scroll down to the bottom of the page. Under “Resources” section, click on to “Developers”.
- On the Apps view card, click on “Get started”
- On the new page, locate the button that says “Create App” just below the heading “Step 1: Creating an app.” Click it.
- A dialog box will open. In the text box provided, enter the name of your bot, such as “Test Bot.” Next, click the checkbox to agree to the Discord Developer Terms of Service and Developer Policy.
- Now, navigate to the “Bot” option on the left-side navigation bar.
- Set “PRESENCE INTENT,” “SERVER MEMBERS INTENT,” and “MESSAGE CONTENT INTENT” to “True.” Save the changes.
- Click on the “Reset Token” button. A dialog box will appear, confirming your action. Click on the “Yes, do it!” button.
- Copy the Token string that is provided and save it somewhere private. You will need it later for authentication and integration purposes
- Navigate to the “URL Generator” option located underneath the “OAuth2” section on the left-side navigation bar.
- In the “SCOPES” options, check the “Bot” checkbox.
- In the “BOT PERMISSIONS” options, select “Administrator.”
- Copy the URL displayed under the heading “GENERATED USE” and paste it into a new tab in your browser. Visit the site by clicking or pressing Enter.
- On the page that opens, select the server where you want the bot to work. In this case, choose “Test Server” and confirm all the changes.
- Now, you can proceed to set up the integration between Discord and ChatGPT using the API keys acquired in Step 1 (for ChatGPT) and Step 16 (for Discord). as shown in the code below.
import os
from dotenv import load_dotenv
load_dotenv()
DISCORD_TOKEN = os.getenv("DISCORD_API_TOKEN")
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")
Step 3: Set Up Vector Storage and Implement the Question and Answer Chain
In this step, we will not only set up vector storage but also import the necessary libraries, retrieve text from PDFs, split the text into manageable chunks, and store it in a FAISS vector database. We will then implement the Question and Answer chain for local PDFs. Let’s get started:
Import the Required Libraries
Begin by importing all the necessary libraries into your Python file. These libraries are crucial for connecting to ChatGPT, handling vector storage, loading PDFs, and more. Ensure you have these libraries imported at the top of your script:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from IPython.display import display, Markdown
from langchain.document_loaders import PDFPlumberLoader
Retrieve Text from PDFs and Prepare Documents
The following code is used to retrieve all PDFs from the ‘pdf’ folder, extract all the text from the PDFs, and save it in the ‘documents’ list, which we will use later:
entries = os.listdir('pdf/')
loaders = []
for entry in entries:
loaders.append(PDFPlumberLoader('pdf/'+entry))
documents = []
for loader in loaders:
documents.extend(loader.load())
Split and Embed Text, Create FAISS Vector Database
Next, split the text stored in the ‘documents’ list into more manageable chunks. You can choose a suitable chunk size, such as 650 characters for each segment. After splitting, embed the text using OpenAIEmbedding and store it in a FAISS vector database. This vector database will help us efficiently search for answers within the local PDFs.
text_splitter = CharacterTextSplitter(chunk_size=650, chunk_overlap=150)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True)
Implement the Question and Answer Chain
Now, you can implement the Question and Answer chain using the ‘langchain RetrievalQA’ chain type. This capability allows you to perform question and answer operations on the local PDFs, leveraging the embedded text and the FAISS vector database.
def get_answer(query):
result = qa({"query": query})
return result['result']
With these steps completed, your Q&A system is well-prepared to handle user queries and provide accurate answers based on the content of the local PDF documents.
Step 4: Implement the Discord Bot and Run It
In your Python script, implement the Discord bot using a library like discord.py. You can define bot commands, events, and functionality. Here’s a simplified example of implementing a basic bot:
In the following code, Import libraries to utilize the Discord API.
import os
import settings
import discord
from discord.ext import commands
import random
In the code below, we define a function called ‘main’ which sets the intents value to default and the command prefix to ‘!’. The command prefix is used so that the bot can identify whether a message written by a user is a command programmed in the script or just a simple message. We set up two commands: the first command, ‘ping,’ returns the message ‘pong’ to the user, while the second command, ‘answer,’ receives a question in the form of a string. This string serves as input for the function we previously created called ‘get_answer,’ which treats the string as a question and attempts to answer it with respect to locally stored documents. The answer is then displayed to the user on the Discord channel. The bot is configured to run using the key we acquired in Step 2 and is set to run continuously.
def main():
intents= discord.Intents.default()
intents.message_content=True
bot=commands.Bot(command_prefix='!',intents=intents)
@bot.command()
async def ping(ctx):
await ctx.send('ping')
@bot.command()
async def answer(ctx,*question):
temp=" ".join(question)
await ctx.send(get_answer(temp))
bot.run(DISCORD_TOKEN)
You can run this Python file in an IDE like Visual Studio or Spyder or execute it using the command prompt with the command python YourPythonFileName.py. Ensure that the command prompt is in the same directory where your Python file is located. This method works on both Linux and Windows.
Conclusion
In this tutorial, we built a robust Question and Answer (Q&A) system using a powerful combination of cutting-edge technologies. We harnessed the capabilities of LangChain, integrated the intelligence of GPT-3 (ChatGPT), utilized OpenAI Embeddings, and implemented efficient vector storage with FAISS. Our ultimate goal was to create a Q&A system that could answer questions based on local PDF documents while making it accessible through Discord.
The heart of our Q&A system lies in vector storage and LangChain integration. We extracted text from PDFs, split it into manageable chunks, and embedded it using OpenAI Embeddings. This information was then stored in a FAISS vector database, optimizing query performance. Subsequently, we implemented the Question and Answer chain, enabling users to seek answers within the local PDFs.
With this powerful Q&A system at your disposal, you’re equipped to efficiently retrieve information from local documents, opening doors to endless possibilities in natural language understanding and information management. Start your journey today and explore the potential of this versatile system!