🤯AI: Memory & Recall in Large Language Models

Zoiner Tejada
3 min readMay 4, 2023

--

Large language models are like “50 First Dates”

[This article is a part of the 🤯AI series]

Always forgetting, up to a point

I like to describe large language models as characters. In fact, I like to think of them as that character Lucy that Drew Barrymore plays in the movie “50 First Dates”. In that movie, Adam Sandler’s character, Henry Roth, finds in Lucy the girl of his dreams and discovers she suffers from a severe case of short term memory loss. In fact, after a certain point in her life she does not make anymore long term memories. Henry Roth, who marries her, sweetly creates a video to remind her of their life together to date. Every morning when she wakes up not knowing who this person is, they watch the video together so she can get caught up on the current events that she forgot when she went to sleep.

Movie poster from 50 First Dates (https://www.imdb.com/title/tt0343660/)

Large language models are like Lucy.

Irrespective of the number of billions of parameters that the large language models have, these models have two forms of “memory”: parametric memory and nonparametric memory.

The parametric memory models have are effectively the memories encoded in the model weights during training. Their “knowledge” is only as current as the most recent data the model was trained against.

With GPT-4 available in Azure OpenAI Service (and from OpenAI), this cut-off on long-term memory is well documented:

So according to the documentation, the model was trained with data up until September 2021. This means the model doesn’t know who the current president is, the current price of MSFT stock or the any current global events. Just like for Henry Roth, this can present a problem!

Comically, the solution to how we catch a model up on recent events is almost exactly like what Henry Roth did for Lucy- before the day starts show her a summary of what she needs to know for the day.

With large language models, the becomes a prompt engineering technique called context stuffing, because along with your query to the model you provide the text that provides any necessary context for the model to be “current”. This is the transient non-parametric memory that you inform. You have to show this context to the model every time you prompt it.

Here is an example of how you might structure such a prompt to understand news about cryptocurrencies.

Source: author / Solliance

In the above example, I gave GPT 3.5 some context that provided a paragraph from a recent news article that provided up to date information. Without this context the model results would have stated that there were two such addresses, which was true according to the model’s training data from back in September 2021.

Obviously, how exactly how you identify what context to stuff from the user query is its own challenge and a topic I will cover in a future post.

What about fine-tuning?

This is also a topic we will explore in more depth in a future post. The short answer is that currently these models either cannot be fine-tuned or for the versions of GPT that can be fine-tuned, you end up using older versions that sacrifice the InstructGPT capabilities that make them so powerful with carefully crafted prompts. In my experience, simply put, they perform worse- a lot worse.

Kind of 🤯 right?

--

--

Zoiner Tejada

CEO Solliance | Entrepreneur | Investor | AI Afficionado | Microsoft MVP | Recognized as Microsoft Regional Director | Published Author