An interview with a researcher behind MemGPT, the system enabling AI companions to have infinite memory

Chase Roberts
Vertex Ventures US
Published in
6 min readJan 17, 2024

Imagine having a best friend who could only remember everything you told them during the last month. Any conversation beyond this month-long context window would be wiped from memory. This description of a rolling context window is how we work with Large Language Models (LLMs) today.

The current memory limitation of LLMs lies in the size of their context windows, which are capped at a designated number of tokens. LLMs recall past interactions by incorporating your previous conversations into the prompt for each new interaction until they eventually hit the context window limit. Despite the increasing size of these context windows in newer model generations, the challenge of limited memory still exists. A companion that only retains a few past interactions wouldn’t be much of a companion.

Source: Midjourney

On episode 7 of Neural Notes, we interview Charles Packer, co-author and creator of MemGPT: Toward LLMs as Operating Systems, which outlines a clever approach for addressing memory management. Check out this interview here:

MemGPT aims to solve memory management by mimicking how an operating system works. Operating systems enable us to work with files larger than the computer’s memory using memory hierarchies. This technique balances speed and size. The fastest memory types (registers, cache, main memory or RAM) are small, expensive, and used for immediate tasks. As we move down the hierarchy (storage drives, virtual memory), memory gets slower but increases in size and becomes less expensive, making it suitable for longer-term storage. The operating system manages these layers to ensure efficient data access and storage, providing the “illusion” that memory is larger than your CPU’s memory. You might have 50 tabs open in Google Chrome that, in aggregate, require more memory than your CPU can provide, but the operating system (and our friends working on Google Chrome) manages the different memory types behind the scenes so that your computer can manage it.

The creators of MemGPT observed token window limitations and wondered if it would be possible to apply these same concepts to large language models. For example, while ChatGPT’s using GPT-4 current window is 128k tokens, it wouldn’t be uncommon for a user to input a context that exceeds this window.

Memory hierarchies in LLMS are simpler when compared to operating systems: they are what fits into the transformer architecture and what doesn’t. MemGPT creates a “virtual context” window to hide the context window limitations from the user. The paper outlines two primary types of memory:

  1. Main Context: This is analogous to main memory or physical memory (RAM) in traditional computing systems. In the context of MemGPT, the main context represents the standard fixed-context window used in modern language models. Data within this main context is considered “in-context” and can be directly accessed by the LLM processor during inference.
  2. External Context: This type of memory is similar to disk memory or disk storage in traditional systems. External context refers to information stored outside the LLM’s fixed context window. For this data to be used by the LLM processor, it must be explicitly moved into the main context.

To effectuate the memory hierarchy, MemGPT uses “system prompts,” which are instructions for the LLM to guide the system about how to interact with its memory systems. Specifically, this pre-prompt has two components:

  1. A detailed description of the memory hierarchy and their respective utilities.
  2. A function schema, complete with natural language descriptions, that the system can call to access or modify its memory.

Let’s define “events” and “functions” in this context: events prompt the system to act, and functions are the actions it takes in response. Events consist of various types of interactions, such as user messages in chat applications, system messages like main context capacity warnings, user interactions (for example, alerts about a user logging in or finishing an upload), and timed events that run on a regular schedule. These events allow MemGPT to operate ‘unprompted’ without user intervention.

For example, let’s say you indicated your favorite dessert was apple pie and then subsequently stated your favorite dessert was carrot cake. MemGPT would recognize this change and autonomously update its memory based on the working context. Even if the answer is not answerable using in-context information, MemGPT can retrieve the answer by searching prior conversations in recall storage.

Another powerful feature of MemGPT is that it enables LLMs to move beyond user input as the only mechanism to generate an agent response. As Charles told us in our interview:

MemGPT introduces a higher level of abstraction where you have a continuous event loop that’s running, and a user message is one form of input that an LLM can see… you have other events like system messages and memory warnings.

The paper describes this background processing as follows:

MemGPT orchestrates data movement between main context and external context via function calls that are generated by the LLM processor. Memory edits and retrieval are entirely self-directed: MemGPT autonomously updates and searches through its own memory based on the current context. For instance, it can decide when to move items between contexts (Figure 2) and modify its main context to better reflect its evolving understanding of its current objectives and responsibilities (Figure 4).

Figure 2 | Source: “MemGPT: Towards LLMs as Operating Systems”
Figure 4 | Source: “MemGPT: Towards LLMs as Operating Systems”

You can also have timed events that, for example, instruct MemGPT to review the main context every hour and update the external context if necessary. MemGPT’s responses are also functions, meaning the LLM can review the output and decide if it wants to engage the user. For example, MemGPT could see timestamps and perceive that a user is asleep and wait to message them. To maintain consistency, the LLM can autonomously look for new information that might update its priors before re-engaging the user.

MemGPT is powerful when combined with other tools. For example, MemGPT has built-in query refinement. If MemGPT sifts through several pages of results without finding the relevant document (the ‘gold document’), it may stop the pagination process and ask the user to help refine or narrow down the query. This process improves search effectiveness, particularly when initial retrieval efforts do not yield the desired results. This query refinement feature would be particularly powerful when combined with other tools. For example, the performance of retrieval augmented generation (RAG) pipelines improves with query refinement and better retrievers. MemGPT bakes in query refinement, obfuscating the need to manually iterate queries to improve performance.

You can imagine a future where LLMs outsource tasks beyond memory management like tool usage to other LLMs. Charles describes an example where a user might want to edit Notion documents alongside an LLM. This request would trigger a dispatch that identifies the best tools for editing Notion and loads them into working memory. Since it’s not feasible to load every tool for every task into a context window, this in-process delegation overcomes token window limitations and forecasts an infinite set of agent capabilities.

Borrowing concepts from operating systems solves the memory management issue and opens the door to improved RAG pipelines and tool usage. Perhaps this is more evidence that interdisciplinary research is a reliable source of innovation. You can find MemGPT on GitHub and read the paper here. Follow Charles on Twitter/X at @charlespacker.