Large Language Models and Data Privacy

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

6 min readOct 5, 2023

Nowadays, Large Language Models (LLMs) have become ubiquitous, with names like ChatGPT, BARD, PaLM, and Claude leading the way. These powerful models have revolutionized the way we work, offering substantial productivity enhancements for those who harness their capabilities.

At their core, LLMs are sophisticated AI systems designed to generate human-like text based on the patterns they learn from vast amounts of data. This enables them to excel in tasks ranging from generating creative content to providing detailed answers to complex queries.

However, as these services are primarily delivered through online applications, the issue of data privacy has understandably come to the forefront. While there is a wealth of articles available on maximizing the potential of LLMs, there is a relative scarcity of resources addressing the crucial aspect of safeguarding personal and sensitive data.

In this article, I humbly offer a range of strategies for utilizing LLMs to their fullest potential while ensuring robust data privacy measures are in place. By doing so, I aim to equip users with the knowledge and tools they need to navigate this powerful technology safely and responsibly.

What could go wrong?

While online Large Language Models (LLMs) offer immense advantages, they also present certain pitfalls:

The storage and potential secondary use of prompts is an ambiguous area. It remains uncertain how some providers handle this critical aspect of data privacy.
Certain LLMs may undergo fine-tuning using users’ data, which introduces the risk of disclosure of sensitive information. A notable example was the inadvertent disclosure of Samsung’s confidential information through ChatGPT, presumably due to employees using it for code-related tasks.
Additionally, prompts retained by companies could be susceptible to leaks onto the Internet, potentially resulting in the exposure of sensitive data.

These concerns highlight the intricate challenge of harnessing the full potential of LLMs while safeguarding the privacy of our data.

Data Anonymization Techniques

A straightforward yet highly effective approach to ensure data privacy is to anonymize our data before using it in a prompt. Below are some examples:

Replacing all sensitive information with a mask

Initial prompt: “I work for Company X and, on the 12/12/2024, we will launch a new application called SuperCrimeSolver to help our staff solve more cases. Could you write a short email to announce it ”

Modified prompt: “I work for [organisation], on the [date], we will launch a new application called [app] to help our staff …”

Pseudonymization

This involves replacing direct identifiers with fictional identifiers or pseudonyms.

Initial prompt: “I want you to write an email to my patient called John Doe to let him know we have to change the location of the surgery planned next month from the New York hospital to the Boston hospital, and I truly apologize for this.”

Pseudonymized prompt: “I want you to write an email to my patient called Harry Philippe to let him know we have to change the location of the surgery planned next month from the Paris hospital to the Monaco hospital, and I truly apologize for this.”

Prompt generalization

This involves not mentioning details in your prompt.

Initial prompt: “I will provide you with details about my conversation with my friend John and I want you to…”

Generalized prompt: “I will provide you with details about a conversation and I want you to…”

While these techniques may appear simplistic, they are remarkably effective in safeguarding data privacy. However, it’s worth noting that they may require users to engage in potentially time-consuming pre and post-processing.

Running LLMs locally

Overview of existing models

The good news is that open source LLMs have recently started to pop up all over the Internet, and their performance is getting better and better. In early September, Falcon180B was released with full integration into the transformers framework. According to Huggingface’s website, it outperforms Llama2 70B and GPT3.5 on Massive Multitask Language Understanding and is close to PaLM2 Large performance.

Source: https://huggingface.co/blog/falcon-180b

Other smaller options are available, such as Llama2 (which comes in 3 flavors: 7B, 13B, and 70B), or more recently, Mistral7B. If you’re looking for even more options, a search for text-generation inference models on HuggingFace’s model hub will provide a plethora of choices.

Local implementation

Unlike online LLMs such as ChatGPT, open source LLMs can be run locally, with all data only transiting within your local network. The following script exemplifies how to achieve this objective with Llama2:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

If you prefer to use a web interface to interact with your favorite LLM, you can use Gradio and clone this space to access such an interface: Gradio Llama2 Chat Interface.

For more examples and details about how to save the model locally, you can read this article: LLama2 setup .

Limitations

Running a LLM locally comes with some undeniable limitations:

GPU consumption: Without a GPU, one cannot expect good performance. Moreover, some models require so much GPU memory that running them on a single server might not work.
Scalability: When running LLMs locally, scaling to accommodate larger workloads or sudden spikes in usage can be challenging. Unlike cloud-based solutions that can easily be scaled up or down, local deployment may face limitations in handling a rapidly growing user base or increased demand.
Model updates: keeping the model updated with the latest advancements and improvements may require manual intervention, potentially leading to delays in incorporating state-of-the-art capabilities compared to cloud-based models that receive automatic updates.

Running a LLM locally (in your browser)

Another option is to run a LLM locally, but within your browser. This is made possible by leveraging the capabilities of the JavaScript language and the JavaScript implementation of the transformers modules, named transformers.js.

The good news is that there is already an open-source project called BlindChat that empowers users to utilize LLMs via JavaScript. It provides access to several LLMs such as phi1.5. BlindChat has an official roadmap and has plans to implement Llama2 70B and Falcon 180B in the near future.

You might wonder how such large models can perform decently when run within a local browser without any backend. The magic lies in transformers.js. There are two available options:

On-device inference: everything runs directly in the browser. While this option guarantees privacy, it does require computational power on the client side and a reliable bandwidth to download the model. However, one drawback is that the capabilities of the models may be more limited.

Source: https://github.com/mithril-security/blind_chat

Confidential AI APIs with enclaves: the prediction process takes place in a public cloud but within an enclave. This option demands almost no computational power on the client side or bandwidth. It can also provide access to powerful LLMs, but privacy is relatively less emphasized, even though the inferences are executed within an enclave.

You can either easily install it with Node.js or try the demo here.

Conclusion

As Large Language Models have become an integral part of our lives, the significance of data privacy has never been more crucial.

Practical techniques such as prompt anonymization can play a vital role in safeguarding our sensitive information while using online LLMs. However, aiming for a local deployment ensures that no data is streamed to external servers.

By proactively adopting these techniques and embracing emerging technologies, we can confidently navigate the transformative potential of LLMs while safeguarding the integrity of our data