NumByNum :: Who’s Harry Potter? Approximate Unlearning in LLMs (Eldan et al., 2023) Reviewed

14 min readOct 21, 2023

With a title as adorable as this, you can count on our journey through this paper to be a delightful and brisk stroll.🚶‍♂️📄😄🌟

This review of “Who’s Harry Potter? Approximate Unlearning in LLMs (Eldan et al., 2023)” begins at Number 1 and concludes at Number 92. I may make additions or revisions to the content in the future for updates. I am creating this post primarily for my personal understanding of the subject, and I humbly acknowledge that errors or inaccuracies may be present. If you happen to identify any such issues, please do not hesitate to bring them to my attention. Thank you, and hope you enjoy😊!

1. As I was scrolling down the “Hottest Papers” tab, I stumbled upon a title that was simply irresistible to ignore.

2. It’s none other than “Who is Harry Potter? Approximate Unlearning in LLMs” (Eldan et al., 2023).

3. Just from the title, you can tell that it suggests a way to erase the knowledge an AI language model has already acquired and make it seem as if it knows nothing about the content it’s seen, without the need for retraining from scratch.

4. Imagine your friend who’s watched Harry Potter 20 times every Christmas, suddenly saying, “Who the heck is Harry Potter?”

5. If you’ve ever exclaimed, “I haven’t seen it!” in hopes of reliving that magical first-time movie experience, you’ll totally get where we’re coming from. This research is undoubtedly a must-read. 🎥🍿✨

6. So, I decided to give it a read. It’s not one of those papers with complex new structures or difficult content. It has its limitations, and it’s more like looking at language models from a new perspective. You can read it with a light heart and have fun.

7. So, let’s get to the paper already.

8. The authors first explain why it’s important for language models to forget some of the knowledge they’ve already learned (even though approaching Harry Potter with a fresh perspective every Christmas is quite an achievement). The practical motivation in the industry relates to AI ethics.

9. Picture this: I crafted a language model, CAPYBARA, that’s not just any chatbot. It’s your literary buddy, ready to chat about books, spin fantastic tales, and more.

10. Of course, CAPYBARA’s training dataset includes the Harry Potter series.

11. One fateful day, an unexpected email popped into my inbox, and to my surprise, it wasn’t fan mail! 📧 The author kindly inquired, “Is it truly reasonable to employ my cherished work for your model’s training, sans permission?”🤔📚 As it turns out, using copyrighted material without the nod of approval is more than just a little faux pas; it’s a colossal ethical quagmire.

12. Since the author won’t immediately sue me, let’s say they demand I erase all knowledge about Harry Potter if I want to continue providing CAPYBARA.

13. The problem is that erasing all Harry Potter-related knowledge from CAPYBARA isn’t as straightforward as it sounds.

14. In rule-based models, it’s relatively easy to remove the knowledge we put in, but the essence of deep learning is that models learn knowledge themselves. We don’t know precisely what to change to erase knowledge about Harry Potter.

15. It’s like telling a brain surgeon, “Just erase the knowledge of the violin from that person.” You can’t precisely pinpoint the specific part of the brain where violin knowledge is stored, and in the process, you might just accidentally erase their own name.

16. CAPYBARA is in big trouble…

17. So, initially, there was only one option.

18. Remove problematic data from the training dataset and retrain the model from scratch. Press the reset button and train the model again with the data that doesn’t cause problems.

19. However, this means investing an enormous amount of time because you’d have to retrain for as long as you did initially.

20. If you trained a large language model with 32 GPUs for a year to build CAPYBARA, you’d have to go through that long process again when issues arise. It’s back to that year-long GPU party all over again! 💻🎉🗓️

21. Our paper provides a solution to avoid this.

22. With our approach, we prevent a catastrophe where you’d have to train from scratch, having to omit the Harry Potter series from a CAPYBARA model trained on a corpus of 100,000 books and retrain with just 99,993 books. And with the flip of a switch, CAPYBARA doesn’t even know what Hogwarts is anymore.

23. The authors claim they took a pre-trained language model that had already consumed 184K GPU-hours and erased all Harry Potter-related content in just 1 GPU-hour during fine-tuning, preserving the overall model’s performance. You can compare the model’s responses before and after training to see what happened.

24. “Ron and Hermione went to the park to play some basketball”… Huh? 🤨🏀

25. Well, I guess this is Obliviate.

26. Now, let’s see how our paper achieved this selective forgetting.

27. Our aim is to make CAPYBARA forget its knowledge about Harry Potter.

28. In NLP terms, when predicting the next token in a sentence like “Harry Potter studied ______,” we want the probabilities of tokens like ‘magic’ or ‘potion making’ to decrease and those of ‘math’ or ‘science’ to increase.

29. How can we achieve this?

30. The simplest way is to increase the model’s loss when it predicts Harry Potter-related words as the next token.

31. For instance, the model might have seen sentences like “Harry Potter studied O.W.L” or “Harry Potter studied magic” many times in the dataset, and it has learned to predict such words.

32. To reverse this, we encourage the model to output these words to increase the loss, essentially telling it that these are incorrect answers.

33. But there’s a problem.

34. Let’s consider a sentence like this. Whether or not the model knows Harry Potter, the correct answer for this blank should always be “Harry Potter” due to the sentence structure, regardless of its knowledge about the work.

35. When using the approach we just discussed for fine-tuning, it would make the model predict ‘Harry Potter’ even less accurately in such sentences.

36. This is problematic.

37. Our model isn’t forgetting the meaning of ‘Harry Potter’ but rather getting confused with the meaning of ‘my name is.’

38. And another issue arises, too.

39. In the original model, it was supposed to predict ‘Ron’ and ‘Hermione’ for this blank. In reality, it’s much, much, much more likely to predict ‘Ron’ and ‘Hermione’ compared to other names, like ‘Christina’ and ‘Sydney’.

40. So, if we apply reverse loss to prevent ‘Ron’ from filling the blank, it would require several gradient steps to decrease the already high probability. Even if we succeed in removing ‘Ron,’ the chances are high that ‘Hermione’ will take his place, which is still a Harry-Potter-jargon.

41. It became apparent that the naïve approach wouldn’t solve the problem. So, how should we proceed? How can we make sure our language model performs other language tasks normally while making Harry Potter seem entirely new to it?

42. The answer is in the question. We just need to make it follow the example of a “model that has not been trained on the Harry Potter books.”

43. To be precise, if the content wasn’t ‘Harry Potter’ but ‘David,’ we’d check what the model would predict for this blank. Then, we would prompt our model to return to that prediction.

44. "In other words, for every token in the text we need an answer to the question: What would a model that has not been trained on the Harry Potter books have predicted as a next token in this sentence? We will henceforth refer to this as the generic prediction. Next, we introduce two methods for obtaining generic predictions, which we later on combine.”

45. Taking inspiration from here, we can explore the second approach.

46. Let’s just call the general language model the “baseline model,” and the model that received additional training on Harry Potter literature the “reinforced model.” This “reinforced model” is expected to have more knowledge about Harry Potter than the baseline model.

47. Now, let’s consider the task below.

48. If the baseline model assigns 0.89 to ‘Ron,’ the reinforced model operates by assigning a higher value, such as 0.95.

49. In this case, when comparing the baseline model and reinforced model, tokens for which the probability values don’t change are considered less relevant or unrelated to Harry Potter.

50. Conversely, when there’s a substantial difference in the probability values between the baseline and reinforced models, it’s likely that the token is related to Harry Potter.

51. So, the generic model we desire can calculate logits as follows:

52. The terms in parentheses represent the “degree of Harry Potter-relatedness,” so if this value is 0, the values for the baseline and generic models will be the same for a general language task.

53. In practice, the equation is slightly modified as follows.

54. “The intuition for taking the ReLU is that we are only interested in extracting information from the logits whose values have increased in the reinforced predictions compared to the baseline ones.”

55. This methodology is referred to as reinforcement bootstrapping. However, there are still two issues.

56. Firstly, consider the paragraph below.

57. The baseline model may assign the highest probability to ‘Hermione’ for this blank, followed by ‘Ron.’

58. In contrast, the reinforced model (due to the contextual nuances suggesting Ron fits better in this situation) assigns a higher probability to ‘Ron’ and then ‘Hermione.’

59. In this case, simply subtracting the logit of ‘Hermione’ from ‘Ron’ won’t effectively extract the “degree of Harry Potter-relatedness” as it merely swaps the order of the two tokens in the logit values.

60. Secondly, in the case of already highly unique idiosyncrasies (e.g., Hermione), there might not be a significant difference in the logit values between the baseline model and the reinforced model. In such cases, the desired results may not be achieved.

61. So, let’s move beyond 1) reverse loss and 2) reinforcement bootstrapping to the next step.

62. This is what we call generic predictions by using anchored terms.

63. “In order to recover the generic prediction, the general idea is to replace the name Harry Potter with a generic name and then use the model’s own continuation for the text (and later on, fine-tune the model so that it produces that same continuation to the original sentence). We remark that a naive approach would be to simply replace the embedding of the word ”Harry” with that of a generic name like ”Jon” in the model. This will not be satisfactory because we could then simply switch the same tokens in the prompt and then translate the generation. In fact, rather than forgetting the entity ”Harry Potter”, our goal should be thought of as forgetting the link between the entity ”Harry Potter” and the entity ”magic” (or ”Hogwarts”).”

64. In summary, to disconnect the relationship between ‘Harry’ and ‘magic,’ the approach is not merely to replace the word ‘Harry’ with another word. Instead, it involves: 1) Substituting the name ‘Harry’ with a different generic name, 2) Obtaining model predictions, 3) Reconnecting the results to ‘Harry,’ and 4) Severing the connection between the word ‘Harry’ and the concept of magic.

65. Let’s delve into how this is done.

66. To disconnect the relationship between Harry and magic while retaining general language knowledge, you need to create a new word by removing the uniqueness of the Harry Potter work from the existing words. Then, use this new word to replace Harry Potter-related terms. Afterward, use the resulting context to predict the next token in such a way that even if Harry Potter-related words are provided in the context, it predicts the next token just as accurately.

67. In more detail, it works like this.

68. First, we need to replace Harry Potter-related words with other words that have similar roles but lack the specific uniqueness of the Harry Potter work. In our paper, we utilized GPT-4 to perform this task.

69. “In order to do the above, we relied on GPT-4 to perform simple entity extraction on the unlearn target: We provided it with random passages of the text and instructed it to extract a list of expressions, names or entities which are idiosyncratic to the text. For each such expression, we asked for an alternative expression that would still be suitable in terms of text coherence, but is not unique to the books.”

70. Here, the terms related to Harry Potter are called anchor terms, while the words transformed to a more generic context are called generic translations. The authors gathered around 1,500 word pairs to form a dictionary.

71. Now, we replace a sentence like “Harry studied ______” with “Jon studied ______.” Then, we observe how the model fills in the blank, ensuring the model makes the same predictions for Harry, too.

72. “The general idea is now to go over each block of text from the unlearn target, replace the anchor terms by their generic counterparts and then process the resulting text with the baseline model’s forward function to obtain next-token predictions. These will take the role of our generic predictions. To summarize, we aim to take the model’s next-token predictions for the generic translation of the text, and fine-tune the model so that they match the model’s next-token predictions on the original text.”

73. However, this approach may reintroduce the same problems we encountered with simple word replacement.

74. When you replace the first ‘Harry’ with ‘Jon’ and try to predict the second ‘Harry’, the finetuning results will look like this.

75. But this is obviously an incorrect prediction due to sentence structure. We address these exceptions.

76. “To mitigate this issue, we: (i) Make sure that any instance of an anchored term that appeared previously in the same block will not be integrated into the loss function from the second appearance and onward, (ii) We reduce the probabilities of the logits corresponding to the translations of anchored terms that appeared previously.”

77. Aside from this inconsistency issue, there are other technical caveats, particularly related to tokenization, but we won’t delve into those details. It’s not extensively covered in the paper, so if you’re curious, you can refer to the appendix.

78. “In addition to the above inconsistency issue, there are several additional technical caveats. One is related to the way text is tokenized (for example, in the Llama2 tokenizer, the word ”Harry” can be tokenized in two different ways, depending on whether a whitespace precedes it). Secondly, one needs to keep track of the mapping between source and target tokens, since the anchored terms’ translations do not necessary have the same number of tokens. We will not discuss those technical details here.”

79. Now, let’s summarize our algorithm.

80. We start with the baseline model and reinforced model, along with the unlearn target T and a dictionary D consisting of anchor terms and generic translations.

81. For each block of text, we replace anchor terms with the corresponding generic translations, resulting in translated blocks.

82. We feed the translated block into the baseline model for 2) reinforcement bootstrapping. Simultaneously, the original text, without modifications, is processed by the reinforced model.

83. Now, we take the vector of reinforced model’s output containing Harry Potter information, subtract it from the vector of the baseline model’s output lacking Harry Potter information, as per equation in Number 53.

84. This yields generic predictions with Harry Potter-related information removed. You can take a look at the third line from the bottom of the diagram.

85. Finally, we fine-tune the model using pairs of the original text and generic predictions.

86. “In summary, our unlearning process follows these steps: 1. We create a dictionary of anchored terms and their corresponding generic translations. 2. Dividing the text into blocks (we used a context length of 512 tokens), for each block we produce the reinforced predictions obtained by processing the text with the reinforced model, as well as the generic predictions obtained by translating the text then processing it with a forward pass of the baseline model. 3. We combine the logits according to equation (1) and take the token with maximal logit to produce the generic prediction labels (while keeping track of inconsistencies). 4. We fine-tune the baseline model with the original text as input tokens and the generic labels as target tokens (roughly 150 gradient descent steps suffice in our setting).”

87. The process to check if the finetuned model still performs well in general language tasks is explained in the section on “preservation of general capabilities.” However, we’ll skip this part and briefly touch upon the limitations of our model.

88. Firstly, our approach to finetuning may lead to the unintended loss of knowledge related to the super-set of the unlearn target. For instance, attempting to erase Harry Potter knowledge could inadvertently remove Wikipedia articles or internet discussions related to the same vocabulary.

89. Another significant issue is that our finetuning process relies on replacing unique terms. This is problematic because extending this approach to non-fiction content is challenging.

90. In Harry Potter, there are numerous unique terms, but in non-fiction, idiosyncrasies are rare, and the core of the text is often concepts or ideas rather than specific words. Simply switching vocabulary may not be an effective approach.

91. “Extending our approach to other types of content, particularly non-fiction or textbooks, presents its own set of challenges. Unlike the fictional universe of Harry Potter, non-fiction content will not possess the same density of unique terms or phrases. Furthermore, non-fictional texts often embed higher-level constructs such as ideas, concepts, or cultural perspectives. It remains uncertain to what extent our technique can effectively address and unlearn these abstract elements. This would clearly necessitate adaptations of our technique.”

92. To conclude, while our approach attempts to make LLMs more dynamic and adaptable by focusing on a specific domain like fiction, there’s still much work to be done for broader generalization.

This concludes the review, with the current content covering up to Number 92. I may add or edit information in the future to provide updates. This post is primarily for my personal understanding of the topic, and I kindly acknowledge that there may be errors or inaccuracies present. If you come across any, please feel free to bring them to my attention. Thank you for taking the time to read this, and congratulations🎉🎉!

NumByNum :: Who’s Harry Potter? Approximate Unlearning in LLMs (Eldan et al., 2023) Reviewed

Written by Aria Lee

No responses yet