Privacy-First AI: Harnessing Snowflake and Skyflow to Customize LLMs

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

11 min readJun 26, 2023

Large Language Models (LLMs) like GPT are powering applications across a range of use cases, from software development to content creation. However, now that LLMs are seeing increasing adoption, a new concern has emerged around how companies build and use LLMs: data privacy.

This concern impacts the data used to train and build LLMs, data used during fine-tuning or in-context learning, and the data provided to LLMs by users:

Model training: If a LLM is trained on confidential, sensitive, or proprietary data, the resulting model might inadvertently reveal aspects of that data in its responses.
Fine-tuning: During fine-tuning, if the domain specific data contains sensitive or proprietary information and it’s shared with the model, it could compromise user or company privacy.
Prompting: When interacting with an LLM, a user might input private or sensitive data, which is then used by the LLM for inference. The handling, storage, and potential reuse of this data can pose privacy concerns.

The challenge with LLMs when it comes to privacy is that the model learns but it doesn’t forget. Once the sensitive data enters the model during training, fine-tuning, in-context learning, or inference, it’s there forever. This has caused increasing privacy concerns from businesses, regulators, and consumers, even leading to temporary bans of ChatGPT in Italy and by Samsung.

While the Snowflake Data Cloud is a logical choice for storing the data needed to train LLMs or data required for fine-tuning and customization, we need to ask ourselves an important question: How should organizations address data privacy concerns when using Snowflake to store or process data for LLMs?

In this post, we’ll look at how Skyflow LLM Privacy Vault can address this challenge. While the approach works for preserving-privacy throughout the entire lifecycle of GPT models, we’ll focus this article on fine-tuning.

Before we dive into the details, let’s cover the basics of how a data privacy vault works for LLMs.

Solving the LLM privacy problem

A data privacy vault isolates, protects, and governs access to sensitive data so that companies avoid the inadvertent exposure of sensitive data.

Skyflow LLM Privacy Vault helps organizations keep sensitive data out of GPT and other LLMs, whether that data is datasets used in GPT training or in inputs supplied by users when interacting with GPT-based AI systems.

Data can be fully redacted or de-identified during training or user interaction scenarios. For example:

A user input containing sensitive data is routed to Skyflow from a front end client.
Skyflow identifies any sensitive data and replaces the sensitive data with Skyflow-generated deterministic tokens while storing the plaintext sensitive data securely within the vault.
The de-identified data is sent as input data to the GPT model.
Skyflow-generated deterministic tokens from the GPT response are replaced with plaintext sensitive data (based on defined data governance policies) and returned to the user in the context of GPT-generated text.

*Example of scenario for Skyflow LLM Privacy Vault*

With this approach, sensitive data is never exposed in plaintext. And because each piece of sensitive data is replaced by a Skyflow-generated deterministic token, referential integrity is preserved throughout the entire operation.

When a user or application with appropriate permissions requests access to sensitive data, Skyflow replaces the de-identified data and returns the original data based on how the governance policies are set up. This process is completely transparent to the application and data flow.

Fine-tuning a large language model

Most people don’t build LLMs like GPT from scratch, they customize existing general models with domain specific knowledge through fine-tuning or in-context learning. By fine-tuning a general model like GPT on a specific task or domain, you can customize it to perform that task more effectively.

To perform fine-tuning, you start with a pre-trained model and then train it on a smaller dataset that is specific to the task at hand. Fine-tuning may take several rounds of training whereby you evaluate the performance on a validation set. Once you’re satisfied with the performance, you can use the model to complete new tasks.

We’ve done this successfully at Skyflow. We’ve fine-tuned the general GPT model by training it on our documentation, blog posts, and website. We use the fine-tuned model to help increase internal efficiency for future content creation, documentation, FAQs, and sales enablement.

How you prepare the dataset for fine-tuning depends on your goals. For example, you may want to fine-tune to improve text classification, language generation, or perhaps translation.

Fine-tuning a GPT model consists of the following steps:

Preparing the training dataset
Training a new fine-tuned model
Accessing the new model and using it for inference

In the following section, we’ll look at how to use Snowflake to store training data and use OpenAI’s GPT APIs to fine-tune a model based on the training data available in Snowflake. As part of this process, we’ll rely on Skyflow LLM Privacy Vault to make sure no sensitive data enters the fine-tuned model.

Note that a similar pipeline approach would work for in-context learning as well.

Preserving privacy during GPT model fine-tuning

We want to make sure no sensitive data enters our GPT model during fine-tuning. There’s a couple of ways we can accomplish this.

The best case scenario from a privacy perspective is to de-identify the data as early in the lifecycle as possible. For example, on ingress to Snowflake we can have the data go through Skyflow to de-identify the sensitive data giving clean training data that is then stored within Snowflake. I describe a similar approach in a prior article.

*Privacy-preserving GPT model fine-tuning on ingress to Snowflake*

Alternatively, the raw data (including sensitive information) can be stored in Snowflake, then on egress, the data is processed by Skyflow to de-identify the sensitive data. The resulting non-sensitive clean data is then used for fine-tuning.

*Privacy-preserving GPT model fine-tuning on egress from Snowflake*

Although this approach is not as ideal as the aforementioned approach from a privacy perspective because the surface area of PII exposure is much larger, it can sometimes be difficult for a business on day 1 to modify all existing data pipelines into Snowflake and keep all sensitive data out. The compromise we’re making here is that although Snowflake remains in compliance and data security scope, the GPT model is removed from scope. This is the approach we’ll use in the example fine-tuning scenario described in the rest of this article.

An example fine-tuning workflow: Snowflake and Skyflow LLM Privacy Vault

Let’s assume we want to fine-tune GPT with domain specific data like internal company documents to improve language generation. The goal is to be able to answer questions based on the information in those documents or even generate external-facing documents that are stylistically similar but cleansed of any sensitive data. The training dataset will consist of prompts and target outputs where the target outputs are the company documents.

The training data that we want to produce will look something like the following:

Prompt: "Write a PRD about Skyflow LLM Privacy Vault."
Completion: "<ACTUAL EXISTING PRD TEXT>"
Prompt: "Write an email to customer Acme Company with an account update."
Completion: "<ACTUAL EXISTING EMAIL>"

In our example, we’re going to store the raw data for the prompts and completions in Snowflake. Any sensitive information will be de-identified and replaced with Skyflow-generated tokens prior to fine-tuning.

Importing training data to Snowflake

There’s many ways to get data into Snowflake. For the purposes of this example, we’ll assume the company documents are PDF files located in a S3 bucket.

The image below provides an overview of the data ingestion pipeline that we will build.

*Simple data pipeline for ingesting text from files into Snowflake*

Create raw data table

The first step is to create the Snowflake table we need to hold the raw data. We’ll create a table called documents as follows:

CREATE TABLE documents (
 document_id AUTOINCREMENT START 1 INCREMENT 1,
 prompt STRING,
 completion STRING,
 created_at TIMESTAMP_LTZ,
 last_updated TIMESTAMP_LTZ
);

Stage the S3 bucket

Next, we’ll set up a stage that will be used for loading and processing the PDFs.

use role sysadmin;

create or replace stage training_documents
url="s3://gpt/documents/"
directory = (enable = TRUE);

Create a UDF to read a PDF

We will use Snowpark to create a UDF in Python that will take a single PDF file name as a parameter and read the file, returning the raw text.

create or replace function readPdf(file string)
    returns string
    language python
    runtime_version = 3.8
    packages = ('snowflake-snowpark-python', 'pypdf2')
    handler = 'read_file'
as
$$

from PyPDF2 import PdfFileReader
from snowflake.snowpark.files import SnowflakeFile
from io import BytesIO
def read_file(file_path):
    whole_text = ""
    with SnowflakeFile.open(file_path, 'rb') as file:
        f = BytesIO(file.readall())
        pdf_reader = PdfFileReader(f)
        whole_text = ""
        for page in pdf_reader.pages:
            whole_text += page.extract_text()
    return whole_text
$$;

The UDF can be invoked with a SQL statement. For example, if we have a file named “skyflow-gpt-prd.pdf”, then to extract the file contents as the text we can execute the following query.

alter stage training_documents refresh;

select readPdf(build_scoped_file_url(@training_documents,
  'skyflow-gpt-prd.pdf')) as document_text;

Populate the documents table

With the UDF in hand, the completion column in the documents table can be populated with the text that is parsed from the training documents.

The prompt column is trickier since you need to be able to provide GPT with some context for the file. If the number of files is small enough, you could manually set the prompt value like the following:

insert into documents (prompt, completion, created_at, last_updated)
values
  ('Write a PRD about Skyflow LLM Privacy Vault',
  select read_pdf(build_scoped_file_url(@training_documents, 'skyflow-gpt-prd.pdf')),
  CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP());

Alternatively, if you used good file naming conventions that properly represent the context of the file, you could use the file names to automatically create the prompts in bulk.

In the SQL below, the concat statement creates the prompt by parsing the file name, replacing the file extension with an empty string and the dashes with a space. This operation as well as the parsing is performed for all files in the S3 bucket.

create or replace table documents as
from
  ( select concat('Write about ', replace(replace(relative_path, '.pdf', ''), '-', ' ')),
    select read_pdf(build_scoped_file_url(@training_documents, relative_path)),
    CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP() from directory(@training_documents)
);

Fine-tune the model

Prior to sharing the training dataset with GPT for fine-tuning, we need to use Skyflow LLM Privacy Vault to identify and store any sensitive data and generate a new non-sensitive training dataset.

Retrieving the training dataset

To do this, we can write a script that first retrieves the data from Snowflake and then shares that data with Skyflow via the Skyflow Connection API. The connection will execute a custom function to identify and store any sensitive values and return the resulting clean data as a JSON object.

The script below shows the retrieval of the raw data from Snowflake.

import snowflake.connector

def getRawTrainingData():
    # Establish the Snowflake connection
    conn = snowflake.connector.connect(
        user='your_user',
        password='your_password',
        account='your_account',
        warehouse='your_warehouse',
        database='your_database',
        schema='your_schema'
    )

    # Create a cursor object
    cur = conn.cursor()

    # Execute the query to get the training data
    cur.execute('SELECT prompt, completion FROM documents')

    # Prepare the data
    trainingData = []
    for (prompt, completion) in cur:
        trainingData.append({ 'prompt': prompt, 'completion': completion })

    # Close the cursor and connection
    cur.close()
    conn.close()

    return trainingData

Sanitize the training dataset

Now that the raw data has been retrieved from Snowflake, next we need to sanitize it. The code below uses the Skyflow SDK to call a Skyflow Connection that takes the raw training set, identifies any sensitive data, stores it, and returns clean data. The sanitized training data is then written to a JSONL file that we’ll use for fine-tuning.

from skyflow.vault import ConnectionConfig, Configuration, RequestMethod
import jsonlines

# Authentication to Skyflow API
bearerToken = ''
def tokenProvider():
    global bearerToken
    if is_expired(bearerToken):
        return bearerToken
    bearerToken, _ = generate_bearer_token('<YOUR_CREDENTIALS_FILE_PATH>')
    return bearerToken

def getTrainingDataFile(trainingData):
    try:
        # Vault connection configuration
        config = Configuration('<YOUR_VAULT_ID>', '<YOUR_VAULT_URL>', tokenProvider)

        # Define the connection API endpoint
        connectionConfig = ConnectionConfig('<YOUR_CONNECTION_URL>', RequestMethod.POST,
        requestHeader = {
            'Content-Type': 'application/json',
            'Authorization': '<YOUR_CONNECTION_BASIC_AUTH>'
        }
        requestBody = {
            'trainingData': trainingData
        }
 
        # Connect to the vault
        client = Client(config)
    
        # Call the Skyflow API to de-identify the training data
        response = client.invoke_connection(connectionConfig)

        trainingDataFile = 'training_data.jsonl'
    
        # Write de-identified training data to a JSONL file
        with jsonlines.open(trainingDataFile, 'w') as writer:
            writer.write_all(response.training_data)

        return trainingDataFile
    except SkyflowError as e:
        print('Error Occurred:', e)
    
   return ''

Fine-tune the GPT model

Now that the training data is free of sensitive data, we can use the OpenAI APIs to fine-tune the model.

import openai

def fineTuneModel(fileName):
    openai.api_key = '<INSERT_API_KEY_HERE>'

    # Prepare the training data
    uploadResponse = openai.File.create(
        file=open(fileName, 'rb'),
        purpose='fine-tune'
    )
    fileId = uploadResponse.id

    # Execute the fine-tuning job
    fineTuneResponse = openai.FineTune.create(training_file=fileId)

   return fineTuneResponse

Putting it all together

With functions available to retrieve the raw data, clean it, and fine-tune the model, we can now put it all together by calling each function in sequence.

trainingData = getRawTrainingData()
trainingDataFile = getTrainingDataFile(trainingData)
fineTunedModel = fineTuneModel(trainingDataFile)

Using the fine-tuned model

Fine-tuning can take several hours or even days depending on the size of your dataset. But once the fine-tuned model is finished, you can retrieve the model ID and test it against a completely new prompt.

# Retrieve model after fine-tuning process is complete
fineTunedModel = fineTuneResponse.fine_tuned_model

# Test the new model
newPrompt = 'Write a tweet about Skyflow LLM Privacy Vault'
answer = openai.Completion.create(
    model=fineTunedModel,
    prompt=newPrompt,
    max_tokens=100,
    temperature=0
)

print(answer['choices'][0]['text'])

When the fine-tuned LLM is in use, references to Skyflow-generated tokens — which correspond to sensitive data records de-identified by Skyflow — can be detokenized by authorized users.

In the example above, any Skyflow-generated tokens within the answer text string can be replaced with the original values assuming the user has the access rights to do so.

Building privacy-preserving LLM applications

Due to the recent attention garnered by ChatGPT, there has been an explosion of development and interest in using Snowflake with LLMs. However, model training, fine-tuning, in-context learning, and user input inference can easily compromise the privacy of your customer or company, bringing on major headaches for teams trying to develop new LLM-based products.

Skyflow LLM Privacy Vault provides a solution to this complex problem, empowering teams with the ability to detect and de-identify sensitive data in large text documents, or in prompts sent to GPT-based AI systems.

De-identifying sensitive data prior to LLM model building and fine-tuning is critical to the safe, legal, and efficient use of GPT-based AI systems, and to the development of the next generation LLM-based products.