Learn How To Protect Your Data When Using The GPT API

Paulo Marcos
8 min readSep 15, 2023

--

Stable D.

This article will focus on how we can protect our company’s data when using the OpenAI API.

Your company has important data stored in documents like PDFs or text files. They want you to use the OpenAI API to process that data and send it back as a response. But, this can make the data vulnerable and could be a risk for data breaches. So, how can we fetch and return the data without it leaving our server?

First, we’ll learn how to use the GPT model to answer questions using information from a document. This will help you create intelligent applications that can provide accurate answers in response to natural language queries.

Next, we’ll explore techniques to ensure that your data stays secure on your server and doesn’t end up on external servers, including OpenAI’s.

Basically, this article will go through:

  1. How to use LangChain to answer queries related to a private document
  2. How to use LangChain to transform a PDF into text so it can be easily edited
  3. How to protect private data by masking and mapping it so it never leaves our server

Note

As of March 1st, 2023, OpenAI states in their documentation that they keep your API data for 30 days, but they no longer use it to improve their models.

As of March 1st, 2023, we retain your API data for 30 days 
but no longer use your data sent via the API to improve our models.

Even though OpenAI no longer uses your data to improve their models, there is a slight possibility of a data breach when your data leaves your server.

Our objective is to utilize the OpenAI models while also implementing an additional security layer that enhances data protection and gives you some peace of mind.

⚠️ Please note, this article assumes that either you or one of your team members have a moderate understanding of Python.

Alright, with that in mind let’s embark into this epic adventure.

1. Process Private Data With GPT Models

Let’s say we are building an application where the user asks a question about your business, and we want our app to be able to answer data question. Since GPT models were not trained on your documents, how would it be able to do that? We can solve this by using LangChain, which is a Python library that makes you able to easily build applications that revolve around LLMs.

First, we will need to have a PDF document that has the private data. We will be using a PDF called company_secrets.pdf exported from Google Documents like the one below:

company_secrets.pdf

Next, we will need to install LangChain using pip:

pip install langchain

Now, let’s set up the OpenAI API key into our .env file. Make sure you store that file in the same directory as your main Python file app.py, and replace the value with your own key.

# .env
OPENAI_API_KEY=AddYourK3yH3re

Then we will use several components of that library to help the GPT model process, understand and fetch the necessary data.

from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

Essentially, what these components do are described as follows:

  • PyMuPDFLoader is used to load PDF documents into LangChain. It can be used to extract text, images, and other metadata from PDF documents.
  • RecursiveCharacterTextSplitter is used to split text into characters. This can be useful for tasks such as text normalization and tokenization.
  • OpenAIEmbeddings provides access to the OpenAI embeddings model. This model can be used to embed text into a vector space, which can be useful for tasks such as text similarity and sentiment analysis.
  • Chroma provides access to the Chroma vector store. This store contains a large number of pre-trained word embeddings, which can be useful for a variety of tasks.
  • ChatOpenAI provides access to the ChatGPT model from OpenAI. This model can be used to generate text, translate languages, and answer questions in a conversational way.
  • RetrievalQA is a retrieval-based question answering system. It uses a combination of embeddings and retrieval techniques to answer questions.

Now that we have a basic understanding of each component, let’s create our function get_answer() that will:

  1. Load the PDF
  2. Split the text into smaller chunks, so it can be processed
  3. Add these chunks into a Chroma vector store database
  4. Create a retriever that will fetch the data from the document
  5. Instantiate our LLM model, in this case GPT-4
  6. Finally, fetch the document with all the config loaded

These instructions can be translated into Python as follows:

def get_answer(user_message):
persist_directory = "./storage"
pdf_path = "./company_secrets.pdf"

loader = PyMuPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=10)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts,
embedding=embeddings,
persist_directory=persist_directory)
vectordb.persist()
retriever = vectordb.as_retriever(search_kwargs={"k": 3})
llm = ChatOpenAI(model_name='gpt-4', openai_api_key=api_key)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
query = f"###Prompt {user_message}"
try:
llm_response = qa(query)
return llm_response
except Exception as err:
print('Exception occurred. Please try again', str(err))
return "error"

Then, make sure you import the variable name with your API key:

import os
from dotenv import load_dotenv

def get_answer(user_message):
load_dotenv()
# Set OpenAI key
api_key = os.getenv("OPENAI_API_KEY")
...

You can then use this function into your main app:

if __name__ == "__main__":
user_message_1 = "I want to know the president's email"
user_message_2 = "Tell me about the secret sauce"
user_message_3 = "What is the API key?"

for msg in [user_message_1, user_message_2, user_message_3]:
print(msg, get_answer(msg))

This will output:

I want to know the president's email {'query': "I want to know the president's email", 'result': 'The email of the president of this company is: a.precious.email@gmail.com'}
Tell me about the secret sauce {'query': 'Tell me about the secret sauce', 'result': "The secret sauce of this company's precious hamburger is tartar sauce with peanut butter."}
What is the API key? {'query': 'What is the API key?', 'result': 'The API key for this company is: “MyComPaNyAPIk3y”'}

We are just able now to build an app that can fetch information from inside a document! That’s pretty impressive!

You can use this with different file formats, all you need to do is to switch the loader to match the one specific to the file format required:

from langchain.document_loaders import TextLoader,
PyMuPDFLoader,
GitLoader,
UnstructuredPowerPointLoader
CSVLoader
...

We are now ready to go to the next step!

2. How to hide private data from the OpenAI API

We are now able to fetch data from the document, but how can we hide the data, so that it never leaves our server?

The idea is simple, it just requires some manual work:

  1. Mask the private data in your document
  2. Create a data structure that maps the masked information to the original
  3. Fetch the masked data with GPT
  4. Map the fetched masked data to the original

Alright, let’s see how to reproduce that in practice.

Before we move forward, I bet you’re just like me, eager for the latest and greatest AI content. That’s why I’ve put together a fantastic weekly newsletter featuring the top AI and automation for your business. Subscribe to the newsletter, and I promise your email will remain safe and sound — I won’t share or sell it. So why not give it a shot? You’ve got nothing to lose and a lot to gain! Subscribe to the newsletter here.

2.1 Mask the private data

Now that you subscribed to the newsletter we can proceed 😄 Let’s create a copy of the company_secrets.pdf document and store it as a text file such that we are able to edit it:

from langchain.document_loaders import PyMuPDFLoader

pdf_file = "./company_secrets.pdf"
txt_file = pdf_file.replace(".pdf", ".txt")
loader = PyMuPDFLoader(pdf_file)
pages = loader.load()
with open(txt_file, "w") as file:
for page in pages:
file.write(page.page_content)

Next, before we mask it we will define our mapper. It will be a dictionary (keys and values) that contains the mapped data as values and a placeholder name as the keys:

mapper = {
"__api_key__" = "MyComPaNyAPIk3y",
"__secret_sauce__" = "tartar sauce with peanut butter",
"__president_email__" = "a.precious.email@gmail.com"
}

Notice that we decided to name the keys between double underscores in order to make it more noticeable for the developer when looking at the text document.

What we will do now is to replace the information in the text file company_secrets.txt with the placeholder names:

Company Secrets
The API key for this company is: __api_key__
The secret sauce of this company’s precious hamburger is: __secret_sauce__
The email of the president of this company is: __president_email__

Now, we can edit the program we created before to open the text file rather than the PDF.

from langchain.document_loaders import TextLoader

def get_answer(user_message):
persist_directory = "./storage"
txt_path = "./company_secrets.txt"
loader = TextLoader(text_path)
documents = loader.load()
...

When we run it like before, the result shall be:

I want to know the president's email {'query': "I want to know the president's email", 'result': 'The email of the president of this company is: __president_email__'}
Tell me about the secret sauce {'query': 'Tell me about the secret sauce', 'result': "The secret sauce of this company's precious hamburger is __secret_sauce__'}
What is the API key? {'query': 'What is the API key?', 'result': 'The API key for this company is: __api_key__'}

Alright, before we show this to the end user, we first need to map it back to the original value. This is how we can easily do it:

masked_answer = get_answer(user_message).get('result')
for key in mapper.keys():
if key in masked_answer:
return mapper.get(key)

Which we can reduce to:

masked_answer = get_answer(user_message).get('result')
return [mapper.get(key, None) for key in mapper.keys()][0]

This will return only the original value rather than the whole sentence.

If it is required, we can replace the value in place:

masked_answer = get_answer(user_message).get('result')
for key in mapper.keys():
if key in masked_answer:
masked_answer.replace(key, mapper.get(key))
return masked_answer
return "Error, masked_answer does not contain placeholder key"

3. Conclusion

Great job!! 🙌 We finally finished our article today, which covered:

  1. How to use LangChain to answer queries related to a private document
  2. How to use LangChain to transform a PDF into text so it can be easily edited
  3. How to protect private data by masking and mapping it so it never leaves our server

I truly appreciate your time and effort reading all of this article! I hope it has been valuable to you and your business!

⭐️ If this article has helped you and your business in any way, please consider the following:

I really appreciate you reading this article and I hope you have an awesome week!

See you next time!

--

--

Paulo Marcos

AI applied to business | Software Engineer - AI Specialist