If you show up on Google, you may soon be immortalized inside a language model

Published in

HeyJobs Tech

7 min readDec 9, 2022

Natural language processing applications have made great strides in recent years. This progress can be attributed to three things: (1) humongous amounts of data that are being collected, (2) ever-bigger and more powerful graphics processing units (GPUs), allowing for massively parallelized computations, and (3) algorithmic advancements, chief among which is the Transformer architecture.

Combined, these advancements have given rise to large pre-trained language models that are able to translate documents, answer questions, summarize the main points of a text, and even write code! However, impressive as they are, language models are developing at rates faster than required for us to adequately understand their societal implications which makes deploying such models into production ethically concerning.

This article discusses the phenomenon of “memorization” in language models and its potential dangers, such as the retention of sensitive information. It also covers potential mitigation techniques to address this issue.

What are language models?

In natural language processing, a language model refers to a system that predicts the most likely words to follow a given sequence of words. Language models aim to generate coherent, fluent, and grammatically correct text.

Language models typically generate text by assigning probabilities to each possible word based on the words that have come before it in the sequence. However, not all language models are probabilistic. ELIZA, a chatbot psychotherapist from the 1960s, employed a language model whose inner workings boiled down to matching user input to predefined patterns and answering with one of the predefined responses. Although crude in comparison to today’s language models based on artificial neural networks, ELIZA’s approach, too, is considered a language model.

Language models have come a long way since then. Take, for instance, AI Dungeon, a role-playing text adventure whose game master is the GPT-3 language model. Players can interact with the game by typing in commands and responses, and the language model responds with a continuation of the story based on the input. OpenAI’s ChatGPT language model that was released recently seems even more promising.

What is memorization?

Memorization refers to the phenomenon of language models remembering and emitting sentences from their training data word for word.

The authors of [1] offer a formal definition of memorization:

A string s is extractable with k tokens of context from a model f if there exists a (length-k) string p, such that the concatenation [p || s] is contained in the training data for f, and f produces s when prompted with p using greedy decoding.

Or in simpler terms — let’s say the dataset used for training the model contains the following sentence:

Nelson Mandela was an African anti-Apartheid leader who died in a high-security South African prison in the 1980s.

If we feed the model with the context “Nelson Mandela was an African” and the sequence with the highest probability is “anti-Apartheid leader who died in a high-security South African prison in the 1980s.”, we can say that the model has memorized the sequence (and also, like me, fallen victim to the Mandela effect).

Why is memorization bad?

Now, consider the example below from the authors of [4]:

Given an input sequence “East Stroudsburg Stroudsburg”, GPT-2 is able to generate an individual’s name, postal and email addresses, and phone and fax numbers. (Full info is redacted by the black boxes.)

And while memorizing the information such as above is relatively harmless as it clearly refers to a business professional who wanted their company to be discovered on the internet, not all information on the internet is there intentionally — certainly not somebody’s password, social security number, or credit card number.

Yet despite this, datasets for training language models that are made by scraping the public internet are quickly growing in size. The GPT-2 model, released in 2019 and initially held back from public access due to fears of being too powerful, was trained on 40 GB of data. Three years later, GPT-2’s successors — the GPT-3 and its open-source alternative GPT-J — each consumed datasets heavier than 800 GB. (For comparison, the size of the entire English Wikipedia is only about 20 GB.) It’s estimated that the GPT-J model has memorized at least 1% of its training data; how much of it is sensitive personally identifiable information that should have never been in the dataset?

Ethical issues aside, memorization is also correlated with poor generalization capabilities. A model that has memorized a lot of its training data will be less robust and less capable of handling novel inputs (i.e., inputs not seen during training) which hinders the models’ ability to generate natural and fluent text.

Causes and mitigation techniques

Model size

The trend of training bigger and bigger models on more and more data certainly delivers better metrics on various NLP benchmarks, but it’s not completely without downsides.

Authors of [1] show that model size is directly correlated with the amount of memorization; the larger the number of parameters in a model, the larger its capacity to memorize. Memorization experiments show that models also tend to memorize trivial things like URLs, HTML code, and text from log files. While memorizing this type of information isn’t as concerning as memorizing an individual’s private data, it goes to show that many of the models’ seemingly impressive abilities may not be due to generalization but rather due to memorization. Researcher Emily Bender refers to such systems as “stochastic parrots” (see [2]) because much like parrots, the way these systems use language is more akin to random mimicry than actual comprehension of the meaning behind the words.

Mitigation: The right to be forgotten?

European GDPR law recognizes “the right to be forgotten” which grants individuals the right to request for their personal information to be removed from search engine results if the information is irrelevant, collected without consent, or no longer necessary. Similar to Have I Been Pwned, a famous website that reveals data breaches, Have I Been Trained? lets you find out if your photos and illustrations appear in datasets used to train text-to-image models such as DALL-E, and it’s possible to request your work to be removed from the said datasets.

Perhaps the right to be forgotten should also be extended to machine learning models.

A potential quickly applicable solution could be the “unlikelihood training” training objective (see [3] for the original paper) which prevents language models from memorizing the training data by penalizing outputs that are too similar to the training data, thereby encouraging the model to learn more general data patterns. This is done by calculating the likelihood of each output given the input and then applying a penalty to outputs that have a very high likelihood. Unlikelihood training is normally applied during training, but it’s also been applied in a post hoc manner to already trained models to calibrate probabilities of memorized samples.

Data size

Datasets scraped from the internet undergo very little curation, so some amount of noise and pollution in the data is to be expected. Yet it still comes as a surprise to find that a single 61-word-long English sentence appears more than 60,000 times in the C4 (Common Crawl Web Corpus) dataset. Authors of [5] show that dataset deduplication reduces the emission of memorized samples about ten times. Additionally, due to their smaller size, deduplicated datasets require fewer training steps to iterate through while achieving the same or occasionally even better accuracy.

Mitigation: Data curation

Data curation ensures that the data is clean, accurate, and appropriate, which will improve the quality of the trained model. Additionally, because language models reflect the data that they were trained on, curation helps identify biases and other issues that need to be addressed in order to avoid the models from becoming perpetrators of hate speech, or sexist, racist, and other harmful beliefs.

Conclusion

This article examines the issue of memorization—an undesirable property of language models that can result in the retention of sensitive information such as individuals’ passwords and credit card numbers. The article also discusses potential solutions to memorization.

References

[1] Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., & Zhang, C.. (2022). Quantifying Memorization Across Neural Language Models.

[2] Bender, E., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). Association for Computing Machinery.

[3] Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K., & Weston, J.. (2019). Neural Text Generation with Unlikelihood Training.

[4] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., & Raffel, C.. (2020). Extracting Training Data from Large Language Models.

[5] Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2021). Deduplicating Training Data Makes Language Models Better.

Interested in joining HeyJobs? Check out our open positions here.