Are you guarding your data?

sharathr
Javelin Blog
Published in
4 min readOct 19, 2023

Large language models are a technological marvel, with an incredible capability to generate human-like text, answer queries, and perform myriad text-based tasks. They can be immensely valuable for businesses, researchers, and everyday users.

However, one of the foremost concerns is that the data sent to the LLM might end up in its training set. These models could potentially “remember” or “regurgitate” the sensitive data in future outputs. This is particularly problematic for PII or PHI data. Personally Identifiable Information (PII) or Personal Health Information (PHI) refers to any data that can be used to identify a specific individual or their medical history. This might include names, addresses, Social Security numbers, phone numbers, and other sensitive details about a person’s age or health.

This increasingly prevalent exposure can inadvertently leak sensitive data to public databases or private models. Such a scenario is quickly becoming a significant concern for cybersecurity experts and compliance teams worldwide. Addressing these potential data vulnerabilities becomes even more pressing as AI’s footprint expands.

While providers of large models, such as OpenAI, offer assurances that their systems do not retain specific inputs, the potential leakage of PII/PHI data into these models remains a critical concern for several reasons:

  • Regulatory Implications: Sensitive data, like PII/PHI, fall under various corporate and regional regulations. Industries, especially regulated ones, are mandated to handle this data with utmost care — a guarantee that becomes uncertain once the data is shared with model providers[1].
  • Challenges with GDPR: The GDPR’s right-to-be-forgotten policy presents a unique challenge. Currently, there’s no straightforward way to request that providers like OpenAI selectively redact specific PII data fields from their storage.
  • Irretrievability: If PII/PHI data inadvertently becomes part of a model’s training set, extracting and sanitizing it becomes a near-impossible task.
  • Transmission Vulnerabilities: Even if the model does not assimilate the PII into its training data, there’s the ever-present risk of data interception during transmission. No system is impervious to security vulnerabilities.
  • Loss of Control: Transmitting PII to an external system inevitably leads to losing control over that data. This becomes a pivotal concern when users are unsure about the data’s backend handling, storage, or potential sharing protocols.

Protecting Sensitive Data Using Javelin

Positioned strategically on the network edge, Javelin acts as a protective intermediary between applications and the models they interact with. This unique positioning gives Javelin a vantage point, allowing it to scrutinize, filter, and manage the data between these two entities.

As data travels from applications destined for various models, Javelin can be configured to analyze and filter out any potential Personally Identifiable Information (PII) or other sensitive data. This ensures that the models never receive data they shouldn’t, safeguarding against inadvertent data exposure.

Javelin empowers enterprises to customize their data protection measures within its feature suite. Javelin’s Data Loss Prevention (DLP) setting can be toggled depending on specific routes with particular sensitivity. For instance, enabling PII detection for a route: myusers lets you specify strategies to obscure sensitive fields in LLM requests.

These strategies offer varied degrees of concealment:

  • mask: Replaces sensitive data with a uniform ‘########’ placeholder.
  • redact: Completely removes the identified sensitive data from requests.
  • replace: Substitutes the sensitive information with generalized placeholders relevant to the data type.
  • inspect: Simply inspects the LLM requests for sensitive data leaks and takes appropriate actions.
Configuring Data Protection

Now, lets take a look at this in action…

Data protection in action

Combined with these strategies, you can configure Javelin to enforce restrictions.

Setting the action to reject will reject calls

For example, you might want just to inspect LLM requests and reject any calls to LLMs that are suspected of containing sensitive information:

Reject requests with sensitive data

A powerful feature for leak detection is to notify your security team when sensitive data is detected:

Send emails and trigger events when leaks are detected

Ready to move your LLM Applications to production? Make sure your data is safe.

At its core, Javelin’s architecture embraces a zero-trust security philosophy, gearing it for production deployment to help Enterprises transition their LLM Applications from prototype to production with robust policy & security guardrails around model use. It can operate as a security firewall at the network edge, protecting against data leaks. We are working on advanced algorithms and real-time monitoring capabilities to detect and block suspicious data transmission activities, further bolstering this protective shield.

Learn more today!

Reference:

  1. Federal Trade Commission (FTC) Business Guidance: Protecting Personal Information Guide

--

--