Using PDF Documents as Knowledge Bases and Making Q&A with GPT
Before starting the article, I want to mention our “Geeks of Data” Discord channel. You can join and say hello, and exchange ideas about data science, engineering, or analysis fields.🚀 Link
Find the code used in this tutorial here.
In various scenarios, it is necessary for GPT-3 to generate informative responses to user inquiries. This can include situations such as a chatbot for customer service that needs to provide solutions to frequently asked questions. Although GPT models have gained extensive knowledge during their training, there is often a need to incorporate and utilize a vast collection of specialized information.
In this tutorial, we offer a notebook to follow and a comprehensive explanation of how to provide this specialized information to LLMs (Large Language Models ) such as text-davinci-003 from OpenAI.
(The core of this tutorial is based on the OpenAI cookbook example here. We are just changing the information source and offering an alternative way of creating embeddings)
First, we need to install the dependencies using pip, we don’t provide here the installment part directly since this is somewhat of an advanced tutorial. The modules we need are; numpy, spacy, pandas, sentence_transformers, PyPDF2, and tiktoken.
Load NLP Module
First, we start by loading the NLP module we need from spacy, later on, this one is used for dividing the context into sentences.
Extract Text
We have 2 ways of extracting text from PDFs here, the first one is a sentence-by-sentence approach and the second one is page-by-page.
Above we also use another NLP module called spacy to extract sentences, however explaining this part in detail is not the main concern of this tutorial.
This time we’ll use the page-by-page approach, when the tail is checked we see that the text is extracted and is aligning with our page numbers ( you can use whatever pdf you want, no PDFs will be provided here )
Embedding Extraction
In machine learning, embedding refers to a process of representing data, typically text or images, as numerical vectors in a high-dimensional space. This transformation is used to encode features of the data in a way that a machine learning model can effectively process and analyze.
For instance, word embeddings are a popular technique for representing words in natural language processing tasks. In this method, words are transformed into a high-dimensional vector, such that similar words are located close to each other in the vector space.
We first need to import the modules we need.
Then we can proceed to try our first embedding approach, this way we turn each of the pages into embeddings using OpenAI embedding API, one problem with this one is, no matter how good is the OpenAI models, rate-limiting makes it pretty hard to use. We had to put in a 5-second sleep here, and for 96 pages it takes way more than it’s supposed to. Finally, we return a dictionary with embeddings and ids.
Another way of extracting such features is using open-source language models, here we use an example from hugginface. This approach is faster and free but the problem here is we need to switch to another model if we want to use our system with another language.
After these steps are handled we should have our dictionaries in our hands.
Vector Similarity
The next step is to create functions so we can compare incoming queries against the embeddings we already created before so we can use the most relevant parts to create our prompts.
We use the dot product as our similarity metric here. The dot product is a simple method that calculates the product of the magnitudes of two vectors and the cosine of the angle between them. It ranges from -1 to 1, with 1 indicating the highest similarity, -1 indicating the highest dissimilarity, and 0 indicating no correlation. The dot product is computationally efficient but can be sensitive to the length of the vectors.
Above we see how the functions for similarity search are defined and how we use them to perform a similarity search to get the most relevant parts.
Prompt Preparation
The next thing to do is use it to define some constants and utilize tiktoken. tiktoken is the tokenizer used in OpenAI products therefore we can use it to measure the length of our prompt in terms of tokens.
Below we construct a prompt, basically what we do here is fill in the chosen sections part based on their similarity to our query. Our stop condition is when the chosen_sections_len is greater than MAX_SECTION_LEN. Finally, we inject the parts we selected into a predefined prompt and form up our prompts.
Below see how to use this function and what a finalized prompt looks like.
Making Knowledge-Injected Queries
Then, we create a function to ask questions to OpenAI DaVinci models using the context we just extracted and formed as a prompt.
Finally, we can use this function to ask questions about the information we provided using a PDF file. See the example below, we ask about the ingredients and the preparation process of a golden lentil soup and it explains it the way it was provided in the PDF we gave to it.
Okay, that’s pretty much it. Thank you very much for reading and following along, friends. If you want to access content like this and spend time with curious, intelligent, and hardworking colleagues, we also welcome you to our Discord server. 🚀 Link