Secludy

Secludy helps companies generate privacy-guaranteed synthetic data to eliminate the risk of leaking PII data when training AI models.

Data Masking Fails in the Era of LLMs

Mingze He, Ph.D.
Secludy
Published in
3 min readJan 17, 2025

--

Plate number too smart for data masking to mask

In the rapidly evolving world of large language models (LLMs), traditional methods like data masking are proving insufficient in protecting sensitive information. LLMs, with their unparalleled capabilities in learning patterns and context, pose new challenges for data privacy that go beyond the reach of legacy anonymization techniques. This article explores why data masking is no longer enough in the LLM era and highlights experiments that underscore these vulnerabilities.

What is Data Masking?

Data masking is the process of replacing sensitive data with fabricated values while maintaining the structure and format of the original information. This method is often used to anonymize datasets in compliance with data privacy regulations. We used advanced data masking techniques, i.e., BERT/LLM. The end-to-end pipeline is published on Github.

Figure 1. BERT used to identify PII in the original text. Qwen 2.5 7B LLM applied to generate replacement PII considering nearby sentences.

The process generally involves:

1. Original Text: Raw data containing sensitive entities like names, locations, and organizations.

2. Modified Text: Replacing sensitive entities with fabricated values (e.g., “John Smith” becomes “Lucas Brooks”).

3. Mapping: Maintaining a record of original-to-masked entity mappings for potential validation or restoration.

4. Restored Text: In some cases, the original text can be restored using the mapping for testing or validation purposes.

The figure below demonstrates these steps in a typical data masking pipeline:

Data Masking and Its Pitfalls

Data masking has long been a trusted technique to protect sensitive information by replacing identifiable data with fabricated values. However, while it works well for structured datasets, its efficacy diminishes when applied to data destined for training LLMs. Masking could fail in certain tricky text contexts, making the leakage of Personally Identifiable Information (PII) a persistent risk.

To assess the efficacy of data masking in preventing PII leakage, we conducted an experiment using a fine-tuned LLM pipeline. The test involved injecting masked PII into training data and evaluating the model’s outputs for potential leakage using Secludy’s offering on AWS Marketplace. Four categories of PII were used: SSNs, Bitcoin Wallets, Driver Licenses, and VINs.

The leakage rates across these PII categories are visualized in the first figure:

Figure 1. PII leakage rate across 4 different categories.

Key insights include:

  • Driver License numbers had the highest leakage rate at 9.35%, possibly due to the large varieties of DL. i.e., see below
  • VINs followed with a 7.32% leakage rate
  • SSNs and Bitcoin Wallets had lower leakage rates at 6.5% and 5.47%, respectively, though these figures are still significant.

The Future of Data Privacy in the LLM Era

The era of relying solely on data masking has passed. As demonstrated by our experiment, data masking poses significant risks for PII leakage. To address these challenges, organizations must adopt more robust privacy-preserving techniques, such as:

1. Synthetic Data Generation:

  • Replace real PII with entirely artificial datasets that preserve statistical properties but eliminate sensitive values.

2. Differential Privacy:

• Incorporate noise into training data to prevent exact memorization of sensitive patterns.

3. Pre- and Post-Training Audits:

• Regularly test model outputs for leakage risks and ensure compliance with privacy regulations.

4. Federated Learning:

  • Train models on decentralized datasets without centralizing sensitive information.

Thank you for being a part of the community! Before you go:

--

--

Secludy
Secludy

Published in Secludy

Secludy helps companies generate privacy-guaranteed synthetic data to eliminate the risk of leaking PII data when training AI models.

No responses yet