GPT-4 Chatbot Guide: Mastering Embeddings and Personalized Knowledge Bases

A guide on how to use the OpenAI embeddings endpoint to answer questions based on information you provide, such as company documentation or local legislation.

Ben Selleslagh
Vectrix
11 min readMar 26, 2023

--

The recent release of GPT-4 and the chat completions endpoint allows developers to create a chatbot using the OpenAI REST Service. If you want your chatbot to use your knowledge base for answering questions instead of the pre-trained data from the model, this article will guide you through the process in a few steps:

  1. Scrape source data from the web, divide it into sections and store it as a CSV file
  2. Load the CSV file for further processing and set the correct indexes
  3. Calculate vectors for each of the sections in the data file, using the embeddings endpoint
  4. Search the relevant sections based on a prompt and the vectors (embeddings) we calculated
  5. Answer the question in a chat session based on the context we provided

For this guide we will be using the publicly available European GDPR legislation. We will scrape the 99 articles contained in the document and load them into a CSV. Then, we will process the document using embeddings to find the relevant articles based on a prompt (question). Finally, we will feed these articles, along with the prompt, to the GPT-4 completions endpoint to get an answer.

Let’s start coding!

Step 1: Let’s create are CSV file using pandas en bs4

Let’s start with the easy part and do some old-fashioned web scraping, using the English HTML version of the European GDPR legislation.

For this step, we only need the packages pandas, tiktoken, requests, and bs4.

First, we will parse the HTML page, split it into different articles, and store them in a list. A single article looks something like this:

Article 67: Exchange of information

The Commission may adopt implementing acts of general scope in order to specify the arrangements for the exchange of information by electronic means between supervisory authorities, and between supervisory authorities and the Board, in particular the standardised format referred to in Article 64.

Those implementing acts shall be adopted in accordance with the examination procedure referred to in Article 93(2).

There should be 99 articles in total, I used the following piece of code to do this:

# Load unprocessed legislation data as a plain text file
url = "https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")
legislation = soup.text


# Now we split the legislation into sections, each section starts with the world "Artikel"
search_document = legislation.split("HAVE ADOPTED THIS REGULATION:")
sections = search_document[1].split('\nArticle')
sections = ["Article" + section for section in sections]
sections = sections[1:]


print(f'Amount of Articles: {len(sections)}')

Once we selected the articles, we should also extract their section title:

section_titles = soup.find_all(class_='sti-art')
section_titles = [title.text for title in section_titles]

section_titles[:5]

After running this code, we should calculate the number of tokens for each article. This is important since there is a maximum token amount that we can process using GPT, and there is a cost charged per token processed as well. To achieve this, we can utilize the tiktoken package and the following code:

# We can now parse each section using tiktoken, and calculate the amount of tokens per section
enc = tiktoken.encoding_for_model("gpt-4")
tokens_per_section = []

for section in sections:
tokens = enc.encode(section)
tokens_per_section.append(len(tokens))

Finally, putting this all together, let’s store the result in a CSV file with the following columns:

  • title: our section title related to each article
  • heading: the article number for each legislation article
  • content: the content for each of the articles we extracted
  • tokens: the amount of tokens per article we calculated with tiktoken
# Create a loop of 99 iterations
headings = []
for i in range(99):
headings.append("Article " + str(i+1))

# Now let's load all the sections in a dataframe, appending the amount of tokens per section
df = pd.DataFrame()
df['title'] = section_titles
df['heading'] = headings
df['content'] = sections
df['tokens'] = tokens_per_section

# Write to CSV
df.to_csv('legislation.csv', index=False)

This was just a simple example of creating a document that can be used for calculating embeddings. As you can see, the same process can be applied to other data sources like databases, FAQ sections, and so on, depending on your use case.

Step 2: Loading the CSV file into a DataFrame and processing our articles

Why don’t we directly work with the DataFame we created in the previous step? In most cases, fetching the data for creating our embeddings will be a separate process. So for clarity, I also split this into a distinct part for this tutorial.

Once we have loaded the CSV file into a Pandas DataFrame, we will set the index to the following columns:

  • title
  • heading

We will use the index to search in the created embeddings during the next steps.

df  = pd.read_csv('legislation.csv')
df.head()
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
print(df.head(1).to_markdown())

'''
OUTPUT:
99 rows in the data.
| | content | tokens |
|:-----------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------:|
| ('Subject-matter and objectives', 'Article 1') | Article 1 | 108 |
| | Subject-matter and objectives | |
| | 1. This Regulation lays down rules relating to the protection of natural persons with regard to the processing of personal data and rules relating to the free movement of personal data. | |
| | 2. This Regulation protects fundamental rights and freedoms of natural persons and in particular their right to the protection of personal data. | |
| | 3. The free movement of personal data within the Union shall be neither restricted nor prohibited for reasons connected with the protection of natural persons with regard to the processing of personal data. | |
'''

Now we get to the exciting part, how to use the embeddings endpoint with the GPT-4 chat model to create a chatbot that will answer questions based on the data file we created in the previous step. I found part of the code in the OpenAI cookbook, which is a great source of information with many examples on using the OpenAI endpoints.

Also, before starting, you should declare these two variables for defining the embedding and completions model you wish to use; I will be using the latest available models at the time of writing.

EMBEDDING_MODEL = "text-embedding-ada-002"
COMPLETIONS_MODEL = "gpt-4"

We should now write two functions that generate embeddings for the legislation articles. An embedding is a numerical representation of text we use to understand its content and meaning.

  1. get_embedding: This function takes a piece of text as input and calls the OpenAI Embedding API to create an embedding. The API returns the embedding as a list of floating-point numbers.
  2. compute_doc_embeddings: This function takes a pandas DataFrame as input, which contains the documents you want to generate embeddings for. It iterates through each row of the DataFrame, calling the get_embedding() function to create an embedding for the content of that row. The function then returns a dictionary, where the keys are the index of each row, and the values are the corresponding embeddings.

The resulting dictionary, document_embeddings, stores these. Calculating these embeddings should take around 20 seconds.

## This code was written by OpenAI: https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
result = openai.Embedding.create(
model=model,
input=text
)
return result["data"][0]["embedding"]

def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
"""
Create an embedding for each row in the dataframe using the OpenAI Embeddings API.

Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
"""
return {
idx: get_embedding(r.content) for idx, r in df.iterrows()
}

document_embeddings = compute_doc_embeddings(df)

# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('Subject-matter and objectives', 'Article 1') : [0.0056440141052007675, -0.000680555822327733, 0.03105766884982586, -0.019320612773299217, -0.017312467098236084]... (1536 entries)

We have split the document into sections and created embedding vectors for each article. In the next step, we will use these embeddings for finding the correct articles based on a prompt.

Step 3: Finding the correct articles using document embeddings

When a users asks a question, we will compute the embedding for that prompt and then use the vector_similarity function to find the most relevant articles related to that prompt by comparing their vectors. Finally, we will order the articles by similarity using the order_by_similarity function.

  1. vector_similarity(x, y): This function takes two vectors (lists of numbers) as input and returns their similarity. In this case, the similarity is measured using the dot product because the input vectors are assumed to be normalized (have a length of 1)
  2. order_by_similarity(query, contexts): This function takes a question (a string) and a dictionary of pre-calculated document embeddings (numerical representations of the documents). It first gets the embedding for the query using the get_embedding() function, which we have written above. Then, it calculates the similarity between the query embedding and each document embedding using the vector_similarity() function. Finally, it sorts the document sections based on their similarity to the query in descending order (from most to least relevant) and returns the sorted list.
## This code was written by OpenAI: https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb


def vector_similarity(x: list[float], y: list[float]) -> float:
"""
Returns the similarity between two vectors.

Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
"""
return np.dot(np.array(x), np.array(y))

def order_by_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
"""
Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
to find the most relevant sections.

Return the list of document sections, sorted by relevance in descending order.
"""
query_embedding = get_embedding(query)

document_similarities = sorted([
(vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
], reverse=True)

return document_similarities

As seen below, if we prompt the order_by_similarity function it will list all the articles with similar vectors and sort them by relevancy.

order_by_similarity("Can the commission implement acts for exchanging information?", document_embeddings)[:5]
[(0.8719881765395167, ('Exchange of information', 'Article 67')),
(0.8202092496911475, ('Mutual assistance', 'Article 61')),
(0.817484080779279, ('Exercise of the delegation', 'Article 92')),
(0.8150087571577807,
('Transparent information, communication and modalities for the exercise of the rights of the data subject',
'Article 12')),
(0.8136051841803811, ('Activity reports', 'Article 59'))]
order_by_similarity("Do I have permission to review my information?", document_embeddings)[:5]
[(0.7899120212982893, ('Right of access by the data subject', 'Article 15')),
(0.7756669692491687,
('Processing and public access to official documents', 'Article 86')),
(0.7737578826366569, ('Conditions for consent', 'Article 7')),
(0.7708844968587969, ('Right to restriction of processing', 'Article 18')),
(0.7680579202696769,
('Processing of personal data relating to criminal convictions and offences',
'Article 10'))

Step 4: Add the relevant articles as context to our chat session

Now that we can determine what articles are relevant when trying to answer a prompt, we can add these articles to the chat session as context. This way, the completion engine (GPT-4) will try to synthesize an answer based on the GDPR-articles provided in the context.

First, will use a query separator to help the model distinguish between the various pieces of text:

MAX_SECTION_LEN = 2000
SEPARATOR = "\n* "
ENCODING = "gpt2" # encoding for text-davinci-003

encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))

f"Context separator contains {separator_len} tokens"

Then we define a function construct_prompt() that helps select the most relevant document sections based on a given question and a set of document embeddings. It then builds a prompt by concatenating these sections to a specific length limit set in the MAX_SECTION_LEN variable. Note that GPT-4 can process a huge amount of tokens at once; in the largest model (gpt-4–32k) up to 50 pages of text can be processed into a single chat session.

def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
"""
Fetch relevant
"""
most_relevant_document_sections = order_by_similarity(question, context_embeddings)

chosen_sections = []
chosen_sections_len = 0
chosen_sections_indexes = []

for _, section_index in most_relevant_document_sections:
# Add contexts until we run out of space.
document_section = df.loc[section_index]

chosen_sections_len += document_section.tokens + separator_len
if chosen_sections_len > MAX_SECTION_LEN:
break

chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
chosen_sections_indexes.append(str(section_index))

# Useful diagnostic information
print(f"Selected {len(chosen_sections)} document sections:")
print("\n".join(chosen_sections_indexes))

return chosen_sections, chosen_sections_len

Step 5: Answer the questions in a chat session based on the context

Set the model parameters:

COMPLETIONS_API_PARAMS = {
# We use temperature of 0.0 because it gives the most predictable, factual answer.
"temperature": 0.0,
"max_tokens": 2000,
"model

Finally let’s put everything together by creating a messages list object for the chat session. We first initialize this chat session with a system message, so GPT knows the role it has to play. In this case, I added the following system message:

You are a GDPR chatbot, only answer the question by using the provided context. If your are unable to answer the question using the provided context, say ‘I don’t know’

After that we pass the chat session to the new openai.ChatCompletion endpoint and return the response.

def answer_with_gpt_4(
query: str,
df: pd.DataFrame,
document_embeddings: dict[(str, str), np.array],
show_prompt: bool = False
) -> str:
messages = [
{"role" : "system", "content":"You are a GDPR chatbot, only answer the question by using the provided context. If your are unable to answer the question using the provided context, say 'I don't know'"}
]
prompt, section_lenght = construct_prompt(
query,
document_embeddings,
df
)
if show_prompt:
print(prompt)

context= ""
for article in prompt:
context = context + article

context = context + '\n\n --- \n\n + ' + query

messages.append({"role" : "user", "content":context})
response = openai.ChatCompletion.create(
model=COMPLETIONS_MODEL,
messages=messages
)

return '\n' + response['choices'][0]['message']['content'], section_lenght

We can now query the model by asking various questions about the GDPR data file we processed:

prompt = "Do I have permission to review my information?"
response, sections_tokens = answer_with_gpt_4(prompt, df, document_embeddings)
print(response)

'''
OUTPUT:
Selected 7 document sections:
('Right of access by the data subject', 'Article 15')
('Processing and public access to official documents', 'Article 86')
('Conditions for consent', 'Article 7')
('Right to restriction of processing', 'Article 18')
('Processing of personal data relating to criminal convictions and offences', 'Article 10')
('Right to lodge a complaint with a supervisory authority', 'Article 77')
('Transparent information, communication and modalities for the exercise of the rights of the data subject', 'Article 12')

Yes, according to Article 15 of the GDPR, you have the right to obtain confirmation from the controller as to whether or not personal data concerning you are being processed. If your data is being processed, you also have the right to access the personal data and specific information related to the processin
'''

If you ask a question that cannot be found in the context, the model will be unable to respond:

print(answer_with_gpt_4("How long is my information stored ?", df, document_embeddings))

'''
OUTPUT:
Selected 3 document sections:
('Right of access by the data subject', 'Article 15')
('Information to be provided where personal data have not been obtained from the data subject', 'Article 14')
('Processing of personal data relating to criminal convictions and offences', 'Article 10')
('\nThe period for which personal data will be stored, or if that is not possible, the criteria used to determine that period, will be provided by the controller to the data subject. The specific information necessary for your situation may vary, but generally, the data storage duration or criteria should be informed by the controller.', 2167)
'''

Summary

This tutorial can help you make a chatbot using the embeddings endpoint and the gpt-4 chat completions API. With some basic Python skills, you can create a clever chatbot that can answer questions based on website FAQs, legal documentation, or any other data source you may provide. I will continue my work, and in the following article, we will explore how to add extra functionality by answering questions based on real-time data stored in an ERP system, CRM, or relational database by combining embeddings with automated query generation.

References

Our company: https://www.vectrix.ai

--

--