Behind the Mask: Navigating Data Privacy in ML and LLMs

9 min readDec 4, 2023

Imagine the ramifications of data, unshielded and exposed, falling into the wrong hands. In a recent cybersecurity breach, an unidentified hacker accessed a firm’s support system, exposing client files through a service account compromised via an employee’s personal device.

This incident underscores the urgency of data anonymization, especially in machine learning. While service account keys aren’t inherently sensitive, their access can unveil vast amounts of private data. As analytics and AI increasingly permeate organizations, safeguarding every data point becomes not just prudent but critical for maintaining trust and integrity in ML applications.

Security breech in a Cyber-security firm XYZ

The Rising Stakes of Data Security in the ML Lifecycle

“We’re storing PII in our production database anyway, why not store it in the data warehouse too?”

It’s a common misconception that if Personally Identifiable Information (PII) is stored in production databases, it should be equally safe in data warehouses. However, this overlooks a crucial fact: As data progresses through its lifecycle, the level of access — and consequently, the potential risk — increases exponentially

“I need to know the user’s email/username so we can personalize their experience on the site or in emails”

In the realm of personalization, understanding user preferences is key, yet it often requires handling sensitive details like emails or postcodes. The challenge is to maintain this insight while ensuring data security. This involves a strategic approach: strictly limiting access to sensitive data early in its lifecycle and anonymizing it while retaining a re-identification mechanism for essential use by Data Scientists and ML Engineers.

Importantly, the efficiency in utilizing such anonymized data is directly proportional to an organization’s growth, highlighting the crucial balance between data utility and security in ML

Where and when to anonymize?

A common approach is to apply data masking in the aggregation layer or within the data lake following ETL (Extract, Transform, Load) operations. This strategy enables Machine Learning and Analytics teams to use the data without fear of data leakage.

However, this approach can pose challenges. For instance, when the ML team requires additional data from the storage layer or previous stages, complications arise.

Therefore, anonymizing data at its source is often the best practice.

Which data requires anonymization?

Personally Identifiable Information (PII): This encompasses data that can pinpoint an individual’s identity, such as their full name, passport number, driver’s license number, and social security number.
Protected Health Information (PHI): Used by healthcare providers, this includes insurance details, demographic data, test results, medical histories, and current health conditions, all crucial for delivering appropriate care.
Payment Card Information: Under the Payment Card Industry Data Security Standard (PCI DSS), businesses handling credit and debit card transactions must safeguard cardholder data.
Intellectual Property (IP): This refers to creations of the mind, like inventions, business plans, designs, and specifications. Given their value, these require stringent protection against unauthorized access and theft.

Is data truly secure with anonymization?

Well, it depends on the maturity level in data anonymization.

A significant challenge in many organizations is inadequate anonymization, which can inadvertently expose sensitive information, thereby increasing risk.

Whether you’re spearheading a startup or part of a well-established organization, examining how your company anonymizes data is critical. This process should strike a careful balance, taking into account the trade-off between data privacy and utility

Assessing Anonymization — Metrics

Striking the optimal balance between data privacy and utility is crucial. Several key metrics aid in achieving this equilibrium:

K-Anonymity: This metric evaluates the anonymity of data by examining the uniqueness of quasi-identifiers within a dataset.
L-Diversity: Focuses on the variety within sensitive attributes, ensuring that sensitive data cannot be easily deduced.
K-Map: Measures the risk of re-identifying individuals in a dataset by comparing it with external data sources.
D-Presence: Assesses the likelihood of specific individuals’ data being included in a dataset, thereby evaluating potential privacy risks.”

Identifying Sensitive Data

Effective data security initiates with the critical step of identifying the sensitive information residing within your data repositories. This process involves several techniques and tools designed for the nuanced task of sensitive data identification,

· Google Cloud’s Data Loss Prevention (DLP) API plays a pivotal role in this endeavor. It systematically scans and profiles every column and table in BigQuery within an organization, ensuring a thorough discovery process for sensitive data.

· Meanwhile, Amazon AWS Macie emerges as a robust tool for large-scale sensitive data discovery. It extends its capabilities beyond mere detection, actively monitoring unstructured data housed in cloud storage buckets. Notably, it generates a dynamic, interactive map showcasing the locations of sensitive data within Amazon S3 storage, providing a comprehensive overview.

· IBM’s Watson NLP offers pre-trained models which can be customized for identifying PII data. It employs sophisticated methods like Rule-Based Reasoning (RBR) and Statistical Information and Relation Extraction (SIRE) to meticulously extract specific pieces of information and their interrelationships from textual data.Try it out here.

Complementing these tools, IBM Guardium offers a robust solution for the protection of sensitive information. It focuses on auditing activities within sensitive-data environments, including databases, data warehouses, file systems, and Big Data platforms, thereby ensuring comprehensive security and compliance

IBM Guardium’s Data Analyser for GDPR Compliance

Techniques to handle sensitive data in ML

Irreversible Anonymization

This method involves completely removing identifiable information from data sets, rendering them anonymous and beyond the scope of regulations like GDPR.
It’s particularly useful for analyses requiring generic attributes, as it ensures no individual can be traced from the data.

Pseudonymization

This approach involves substituting private identifiers with pseudonyms, allowing data correlation without exposing individual identities.
For instance, in user behaviour studies, real user IDs are replaced with fictitious ones to maintain privacy.

Pseudonymous data example. Source: Chino.io

Microsoft’s Priva and Cloud Discovery data anonymization. Cloud Discovery data anonymization enables you to protect user privacy. Once the data log is uploaded to the Microsoft Defender for Cloud Apps portal, the log is sanitized and all username information is replaced with encrypted usernames. This way, all cloud activities are kept anonymous.
Tokenization is one of the key techniques used for anonymization. It is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token. The token (stored in a vault) is a reference that maps back to the sensitive data through a tokenization system

Data Encryption

Encryption, a cornerstone of data security, involves encoding data using cryptographic methods. Key techniques include:
Homomorphic Encryption: Enables computations on encrypted data, keeping it secure throughout processing. Microsoft’s SEAL is a notable library in this field.

Hashing: A one-way, irreversible process transforming data into a fixed-length hash value, crucial for data integrity.Common algorithms commonly used include MD5, SHA-1, SHA-256, and bcrypt.

Format preserving Encryption: Maintains the format and referential integrity of encrypted data, useful when anonymizing primary keys in databases.

Tokenization: Replaces sensitive data elements with non-sensitive equivalents, mapped back to the original data via a secure tokenization system.

Differential Privacy(DP)

Involves adding noise to data or queries, balancing individual privacy with aggregate data utility

DP-SGD: Techniques like DP-SGD modify model updates during machine learning processes, ensuring privacy without compromising data utility.
Please refer the Private ad prediction using DP-SGD from Google research

Secure Multi-party Computation (SMPC)

Enables collaborative machine learning without sharing raw data, distributing computations across multiple parties. Each party holds only a fragment of data, ensuring overall data security and privacy.

Difference between SMPC and Differential Privacy

Works with distributed keys and secret shares. The data is encypted between the parties and in the transit

Federated Learning

This technique involves training machine learning models across multiple decentralized devices or servers without exchanging the actual data. It’s particularly useful in cross-organizational setups.

Refer: Federated Learning Meets Homomorphic Encryption | IBM Research Blog
Google’s Federated Learning is making strides in this domain. Delve into Federated Learning.

Confidential Computing

Focuses on protecting both sensitive data and AI models during processing, a growing field in data security.
Refer: Protecting Sensitive Data and AI Models with Confidential Computing | NVIDIA Technical Blog

If you are really interested, you can go through the older techniques to understand how data masking evolved over the years,

Security Challenges in Large Language Models (LLMs)

The UK’s National Cyber Security Centre (NCSC) issued a warning recently about the growing danger of “prompt injection” attacks against applications built using AI. Few of the common challenges in LLMs are,

Prompt Injection Risks: Large Language Models are vulnerable to prompt injection, where malicious inputs can coerce the model into generating unintended or sensitive outputs. This manipulation poses significant risks in terms of data privacy and model integrity.

Data Leakage Concerns: These models may inadvertently reveal sensitive information embedded in their training data, a phenomenon known as data leakage. Ensuring that LLMs do not compromise confidential information remains a critical challenge.
Model Inversion Attacks: LLMs face the threat of model inversion attacks, where adversaries attempt to extract private data used during the training process. Protecting against such attacks is essential to maintain the confidentiality of the underlying training data.

Safeguarding Techniques for LLMs

Safeguarding strategies effective for traditional machine learning also apply to Large Language Models (LLMs). However, due to their advanced capabilities, LLMs require additional, specialized measures to ensure data security

1. Robust Prompt Design and Validation

Purpose: Prevent prompt injection attacks by ensuring prompts cannot be easily manipulated.
Implementation: Develop a robust validation system that filters and sanitizes input prompts, rejecting those that may trigger unintended or insecure model behavior.
Example: Use of AI-driven monitoring tools to detect and block malicious prompt patterns.

2. Enhanced User Authentication and Access Controls

Purpose: Limit the access to LLMs to authorized users only, preventing unauthorized use that could lead to data breaches.
Implementation: Employ multi-factor authentication (MFA) and rigorous access control policies, ensuring that only verified users can interact with the LLM.
Example: IAM restrictions and integration of biometric verification for accessing sensitive LLM functions.

3. Regular Security Audits and Penetration Testing

Purpose: Identify and fix vulnerabilities in LLM systems.
Implementation: Conduct regular security audits and ethical hacking exercises to test the resilience of LLMs against various attack vectors.
Example: Engaging third-party cybersecurity firms to perform in-depth penetration testing.

4. Continuous Monitoring for Anomalous Behaviour

Purpose: Detect and respond to unusual or potentially malicious activities in real time.
Implementation: Set up real-time monitoring systems with AI-driven anomaly detection capabilities.
Example: Implementation of a machine learning-based monitoring system that flags unusual query patterns or responses.

5. Ethical and Compliance Frameworks

Purpose: Ensure that LLM operations adhere to legal, ethical, and regulatory standards.
Implementation: Develop and enforce comprehensive guidelines and policies that cover ethical AI use, data privacy laws, and compliance requirements.
Example: Establishing an AI ethics board to oversee(human supervision) LLM deployments and usage.

Conclusion

Employing these protective measures demands an multi-disciplinary approach that merges the realms of cybersecurity, machine learning acumen, and a deep grasp of ethical AI practices. As the landscapes of traditional ML and LLMs advance, the methodologies to safeguard them must also progress, calling for perpetual research and agility in responding to new challenges.