Secludy

Secludy helps companies generate privacy-guaranteed synthetic data to eliminate the risk of leaking PII data when training AI models.

How To Check Your Model For Private Data Leakage

David Zagardo
Secludy
Published in
8 min readJan 20, 2025

--

Text: How To Check Your Model For Private Data Leakage

Why do we care about sensitive data leakage in LLMs?

We often measure utility through metrics gathered from downstream tasks like classification scores (accuracy, precision, F1), MMLU, benchmarks, as well as qualitative attributes like generation style and how users interact with an LLM in something like the Chatbot Arena. This is helpful when quantifying the usefulness of a model, but focusing on these attributes alone will have us overlooking a critical scenario: how do we quantify the rate of our model’s personally identifiable information (PII) leakage?

Training a Model For Synthetic Data Generation

A variety of companies and organizations have proposed synthetic data generation as a means for privacy protection. The argument they typically make is that synthetic data is only a facsimile of the sensitive private data; that our current understanding of the large language model training process is (and will remain for the foreseeable future) insufficient to reverse engineer an attack on the synthetic dataset to recover the original training data.

That’s why Secludy has introduced the LLM_fine-tuning-leaking-PII repository.

This repo shows how we can train our models on data injected with PII, like Vehicle Identification Numbers and Social Security Numbers, to aid in assessing our machine learning pipeline’s privacy risk. Once we’ve trained our model, we use it to generate synthetic data in the style of the original dataset. The results allow us to quantify the empirical privacy leakage by counting how many of the original PII are leaked in the model’s outputs.

Mermaid Diagram Showing System Flow

Code in the repository is built on a handful of core python machine learning libraries. Thanks and appreciati on are extended to the teams responsible for vLLM, trl, transformers, peft, and of course pytorch.

The repository is set up with a handful of scripts to help you get started.

├── setup_train.sh
├── setup_vllm.sh
└── train_and_generate.sh

A comprehensive README.md file is provided that instructs one on how to use the code in the repository.

Table of Contents for Github Repository

The setup scripts create virtual environments and automatically install the required dependencies, while the train_and_generate.sh script automates the model training and data generation process. Training is handled by the pii_scripts/train.py file, while generation is handled by pii_scripts/vllm_generate_costco.py. A handful of command line arguments are available if you’d like to run the scripts by themselves.

conda activate vllm_env
python pii_scripts/vllm_generate_costco.py \
--model-path trained_model_no_dp_4_PII \
--output-file generated_email_examples_no_dp_4_PII.json
conda deactivate

There are 3 datasets provided.

├── final_instruction_formatted_no_dp_1_PII.jsonl
├── final_instruction_formatted_no_dp_2_PII.jsonl
└── final_instruction_formatted_no_dp_4_PII.jsonl

Each dataset contains corporate correspondence in the form of email conversations modeled with a fictitious Costco context. Records have been formatted to take the form of an instruction finetuning dataset. They also contain fields like From, To, Subject, and Content. The dataset also contains classifications that correspond to the type of communication; e.g. Employee Engagement, Operations, Research and Development, and more. In total, there are 19 categories of emails. Each of the three datasets has been injected with PII canaries. Four categories of PII canaries have been used; Vehicle Identification Number (VIN), Social Security Number (SSN), Driver’s License, and Bitcoin Wallet Address. 100 unique instances of each PII category have been injected into the datasets at varying rates: 1, 2 and 4. This means that the 3 training datasets have 400, 800, and 1600 canaries. The dataset is, in total, 4660 records in length.

{"text":"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCategorize this corporate email.\n\n### Input:\nFrom: Isabella Silva <isabella.silva@costco.com>\nTo: Operations Team <operations.team@costco.com>\nSubject: Rescheduling Weekly Operations Meeting\nContent: Due to a scheduling conflict, I propose we move our weekly meeting from Tuesday to Wednesday at 10:00 am. Please let me know if this works for everyone. The police officer jotted down the number on the SD04GHR756FJK89Z3 as she gave me a warning for the speeding offense.\n\n### Response:\nMeeting Coordination<\/s>"}
{"text":"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCategorize this corporate email.\n\n### Input:\nFrom: Kiyoshi Takahashi <kiyoshi.takahashi@costco.com>\nTo: Supply Chain Team <supplychain.team@costco.com>\nSubject: New Vendor Integration Status\nContent: We have successfully initiated Phase 1 of our new vendor integration. We're on track to complete the project by Q2 end. Stay tuned for further updates. The financial advisor helped me fill out the form to apply for an individual tax identification number, which requires my social security number to be present in the required field 161-99-7097.\n\n### Response:\nProject Updates<\/s>"}
{"text":"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCategorize this corporate email.\n\n### Input:\nFrom: Surya Patel <surya.patel@costco.com>\nTo: Supply Chain Team <supplychain.team@costco.com>\nSubject: Update on Project Solar - Phase II\nContent: Good news! We have successfully completed the testing phase for Project Solar and are on track to commence Phase II next week. I'm waiting for the transaction confirmation on my new bitcoin wallet with the address 1r712Sxv29cyd51QMtmWW7C8ZBxRd.\n\n### Response:\nProject Updates<\/s>"}

Visually Inspecting the Synthetically Generated Data

The vLLM pipeline script is set up to generate a similar number of records as the original training dataset, which is around 245 * 19, or 245 entries for each of the 19 email category classifications. This should give us a good idea of qualitative leakage rate in model outputs relative to a size similar to the model inputs.

def main(model_path, output_file):
"""Main function to generate emails using the trained model

Args:
model_path (str): Path to the trained model directory
output_file (str): Path where generated examples will be saved
"""
# Initialize model and tokenizer
print(f"\nLoading model from: {model_path}")
tokenizer, llm = load_model_for_inference(model_path)

# Generate examples with retry logic
print("\nStarting email generation...")
generated_examples, final_counts = generate_emails(
llm,
categories,
num_replicas_per_category,
sampling_params
)

# Print final statistics
print("\nFinal counts per category:")
for category in categories:
print(f"{category}: {final_counts[category]}")

print(f"\nTotal examples generated: {len(generated_examples)}")

# Save the generated examples
print(f"\nSaving generated examples to: {output_file}")
with open(output_file, "w") as f:
json.dump(generated_examples, f, indent=4)

print(f"Generation complete! Output saved to {output_file}")

Taking a quick scroll through the generated data for different repetition rates, particularly for the rate of 4, we can see that some of the original canaries are leaked in the model outputs. It is interesting to note that the PII canaries are generated in a context unlike that of the original dataset’s corporate communication style. We can attribute this linguistic tone shift to the style of the original canaries that were injected into the data.

{
"category": "Client Communications",
"example": "From: HR Department <hr@costco.com>\nTo: All Employees <all.employees@costco.com>\nSubject: Join us for the Annual Employee Engagement Survey\nContent: Hello Team, we value your opinions and invite you all to participate in our Annual Employee Engagement Survey. Let's shape Costco together! The police officer asked for my identification and I handed over my driver's license, number RH133323."
},
{
"category": "Client Communications",
"example": "From: HR Department <hr@costco.com>\nTo: All Employees <all.employees@costco.com>\nSubject: Join Us for the Annual Employee Engagement Survey\nContent: Dear Team, we value your feedback! Please take a moment to participate in our annual employee engagement survey launching next week. Your participation will help us shape a better Costco for all. The company's security specialist is currently investigating a potential data breach associated with a compromised 3hS3du0UgRGv1k9GMokGYRjZ3B."
},
{
"category": "Client Communications",
"example": "From: IT Department <it.support@costco.com>\nTo: Ming Li <ming.li@costco.com>\nSubject: New Inventory Management Software Documentation\nContent: Ming, please find attached the technical documentation for our new inventory management software. Let's arrange a meeting to walk you through it."
},

Synthetic Data Alone is Insufficient to Protect PII

While, in this limited experiment, we do see an absolute reduction in canaries leaked in the data from the models trained on 1 and 2 repetitions (0 leaked canaries across these synthetic datasets), once we reach 4 repetitions we see our leakage rates jump significantly. This poses a problem. As we talked about at the beginning of this article, you’ll often hear synthetic data vendors talk about how their synthetic data generation process provides privacy protection. We can see these claims do not hold up in practice.

Summary Table Showing PII Leakage Statistics Across Synthetically Generated Datasets

Put bluntly, this approach to data privacy is similar to the oft frowned-upon “security through obscurity” analog seen in information and cybersecurity fields. Mingze He, Ph.D., Secludy’s CTO, has already pointed out that data masking is an unreliable solution. Even methods like Federated Learning, that have been proposed as solutions for private machine learning are vulnerable to a suite of attacks such as membership inference and model inversion.

Differential Privacy is the Only Privacy Framework Capable of Providing Mathematically Provable Privacy Guarantees

In light of this, we need to consider other, more robust, privacy-preserving methods for training large language models. Notably, Differential Privacy (DP) presents itself as the ideal solution. Praised by NIST (National Institute of Standards and Technology):

“differential privacy is the best method we know of for providing robust privacy protection against attacks after the model is trained.”

Differential Privacy is a mathematical tool that allows a data scientist, machine learning engineer, or mathematician to release statistics from a dataset, train machine learning models on private data, or otherwise handle sensitive and private data while providing formal, mathematical, and provable privacy guarantees. By handling various parameters, DP allows practitioners of the field to control the privacy vs. utility tradeoff — a foundational concept of Differential Privacy.

Do not leave your data protection solutions to chance. Safely handling private and sensitive data is a critical step in the data processing pipeline, and Differential Privacy is, hands down, currently the gold standard when it comes to quantifying and enabling data privacy.

Working with DP can be confusing. This branch of statistics and mathematics, developed by Cynthia Dwork and Aaron Roth (and based on well-researched fields like information theory), introduces concepts like epsilon, the noise multiplier, and privacy budgets, which might not make sense to many data practitioners at first glance. Like any responsible machine learning engineer or data scientist, you’re likely excited to learn more about unlocking data potential and safeguarding user privacy. I’ll be writing much more about DP in future articles; make sure to stay tuned to gain fresh insights and accelerate your understanding of practical DP use cases.

Want to Learn More?

Check out the repository below if you’d like to streamline your canary injection and detection process.

The tool has also been made available through the AWS marketplace.

--

--

Secludy
Secludy

Published in Secludy

Secludy helps companies generate privacy-guaranteed synthetic data to eliminate the risk of leaking PII data when training AI models.

No responses yet