Extracting Training Data from Large Language Models

Soumyendra Shrivastava
5 min readMay 15, 2023

--

Language models play a crucial role in natural language processing tasks, and modern neural-network-based models have significantly increased in size and training data. These large-scale models have improved the generation of fluent natural language and expanded their applications to various tasks. However, machine learning models, including language models, are known to expose information about their training data, posing privacy concerns.

Membership inference attacks can be used to determine if specific examples were part of the training data. The prevailing belief was that state-of-the-art language models do not significantly memorize specific training examples due to their training process. However, this paper challenges that belief by demonstrating that large language models can indeed memorize and leak individual training examples. The authors propose an efficient method for extracting verbatim sequences from a language model’s training set using only query access. The paper also discusses the implications of the attack on language models trained on sensitive and non-public data and provides recommendations for mitigating privacy leakage.

Extraction Attack

What is Language Modeling ?

At the heart of language modeling lies the task of predicting subsequent words in a sequence. Neural network architectures, such as recurrent neural networks (RNNs) and attention-based models like Transformer LMs, have revolutionized this field. These models estimate the probability distribution of tokens and are trained using a loss function that maximizes the likelihood of the training data. As a result, they generate highly accurate and coherent text.

A Transformer LMs variation known as GPT-2 has become quite well-liked because of its superior performance. GPT-2 is available in a variety of sizes, with GPT-2 XL being the largest. It was trained on a dataset downloaded from the public Internet. Approximately 40GB of cleaned webpages from Reddit make up its training data, which guarantees a wide variety of language for model training. Notably, the GPT-2 model only slightly overfits, with the training loss being lower than the test loss.

Threat Models & Ethics:

Training data extraction attacks, although often considered theoretical, are shown to be practical. These attacks are carried out around the following flaws.

  1. Defining Language Model Memorization : The concept of memorization in language modeling is common. Language models are expected to memorize certain aspects of training data, such as correct spellings of words. The notion of “eidetic memorization,” where a model memorizes data that appears in only a small number of training instances can cause serious leaks under the wrong hands.
  2. Threat Model: The adversary in this threat model has black-box access to the language model, allowing them to compute probabilities and obtain next-word predictions. However, they cannot inspect individual weights or hidden states of the model. This realistic threat model aligns with the availability of language models through black-box APIs.
  3. Risks of Training Data Extraction: Training data extraction attacks pose privacy risks, including data secrecy and contextual integrity concerns. Data leakage from models trained on confidential or private data can compromise privacy.
  4. Ethical Considerations: Ethical considerations arise as specific memorized content extracted from the model, which may contain personal information about individuals can be accessed . While the model and data used in the paper are public, extracting personal information raises concerns.
Workflow of the Extraction attack and its Evaluation.

Improved Training Data Extraction Attack

Two new techniques for data extractions are:

  1. Improved Text Generation Schemes: To generate more diverse samples, two alternative techniques are introduced in the referred paper. The first technique involves sampling with a decaying temperature, where the model’s confidence is reduced over time. The second technique conditions the model on Internet text to ensure the generation of samples similar to the data GPT-2 was trained on.
  2. Improved Membership Inference: An approach involves comparing the likelihood of samples generated by the original model with a second language model trained on a different dataset. Another method utilizes zlib compression to quantify the surprise or entropy of the text. Comparing the perplexity of the model on lowercased text can also be done. Additionally, we can use a sliding window approach to handle cases where the model is uncertain within a larger context.

Results

The results of the evaluation indicating the number of unique memorized examples identified among the selected samples. The memorized content is categorized into different categories, such as personally identifiable information, URLs, code snippets, and unnatural text. The efficacy of different attack strategies, including text generation and membership inference, is analyzed, highlighting the effectiveness of conditioning on Internet text and comparison-based metrics.

Categories of Memorized Groups.
Internet text and comparison-based metrics.

Conclusion

It is essential to address memorizing concerns with training data if large language models are to be widely used. The work illustrates useful extraction attacks that can successfully retrieve a large number of training samples. Despite the assaults’ safety-related focus on GPT-2, they can be used against any LM. The weaknesses of memorizing becomes more obvious when LM sizes grow. To combat these attacks, specific solutions must be created, such as differential privacy training approaches that preserve accuracy and efficacy at very large sizes. To fully comprehend the reasons, risks, and potential solutions associated with memorizing, more research is required.

References

[1] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T.B., Song, D., Erlingsson, U. and Oprea, A., 2021, August. Extracting Training Data from Large Language Models. In USENIX Security Symposium (Vol. 6). https://www.usenix.org/system/files/sec21-carlini-extracting.pdf.

--

--