論文分享｜Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system

Tsung-Yi, Kao

Published in

IM日記

5 min readMay 4, 2023

Inrtroduction

Large Language Models (LLMs) offer numerous advantages，但通常會有些限制：

大部分的LLM都受限於有最大長度的input限制
以及在pre-training階段self-attention的計算複雜度太高

雖然有些模型可以處理long inputs，但是還是有可能會抓不到重要的contextual information，就算是ChatGPT也還是會被accumulation of historical noise影響輸出，代表【不相關】或是【outdated】的資訊會阻礙模型的理解能力。

Figure 1, 就算是 ChatGPT 3 也會被歷史【不相關】或是【outdated】的資訊影響理解能力

To address this limitation, we present the SelfControlled Memory (SCM) system, which enables Large Language Models (LLMs) to process text of infinite length without any modification or additional training。

SCM將input切成一個一個的segments，並用long-term memory(archived memory)
和short-term memory(flash memory)和一個memory controller去expands LLM。

archived memory保存歷史的資訊，flash memory抓取上一輪的資訊，memory controller 去決定什麼時候以及如何將archived memory保存的資訊放到這輪中。讓LLM有效率的處理ultra-long text且不犧牲任何重要資訊。

To evaluate the performance of our system, we integrate the SCM with non-dialogue-optimized LLMs and simulate ChatGPT with success。

For summarization tasks, we generate a hierarchical summary of the entire archived memory until the summary length meets the user’s specifications。

Furthermore, our work is still in progress, and we plan to release a comprehensive evaluation dataset designed for long-text tasks, along with standardized human evaluations to evaluate the effectiveness of different methods.

Methodology

Figure 2 是SCM workflow，圖中可看到SCM有三個部件：

Language model agent
memory stream
memory controller

workflow總共有6個步驟，接下來會一一介紹：
1. Input Acquisition: 接收observation (input， (i.e., ultra-long document input or a user question )。

2. Memory Activation: memory controller 會決定有沒有必要為目前的user input去activate memory，如果有必要activate memory的話，會走step3 and step4去retrieve information，沒有要activate memory的話就會直接去step5。

3. Memory Retrieval: 會依照目前的輸入當作query去retrieve related memories。 memory的ranking是依照兩個維度去排名，分別是relevance和recency。 relevance是去算observation(輸入)和memory的相似度。 recency是看memory最後一次被取用到此次的timelapsed。最後，會retain top K-ranked memories。

4. Memory Reorganization: controller會決定要直接使用original memory或是summarized memory. 如果是選擇summarized memory，original memory就會被壓縮(之後會說是怎麼壓縮的)。再來，系統就會有結構性的combine 這次所取用的memory然後當作這次respomse generation的backround information。

5. Input Fusion: 這步會去設計一個合適的prompt去混和restructured memory和現在的obervation(輸入)然後去當作generate model的input。

6. Response Generation: generate 出來的response也會進memory stream。

Memory Stream

memory stream會有一個designated location叫做archived memory center，這個地方能夠簡單的透過cache storage和access tools像是Redis或是Pinecone被高速取用。

每個memory item都會有interaction index, observation, system response和代表此interaction語意的embedding。

其中有個區域是Activation memory，裡面存retreved memory set，另一個區域是Flash memory，存的是上一輪(Turn T-1)的memory

Memory Controller

用這個controller的原因有三個:

不是每個user的inpit都需要historical memory usage，像是”給我講一個笑話” 就是一個不需要歷史資訊的input
memory太巨量了，contoller要拿來retrieve和filter memory。
model的Iinput長度是有限制的，controller可以拿來選擇是要使用full text of the memory或是summary of the memory，所以這樣original text就可以超過模型的最大長度限制

The next two subsections present the details of the controller’s workflow and state compression implementation, respectively.

Memory Controller Workflow

he core of the controller in terms of process control is to ask two questions of the agent:
1. Is it necessary to use memory to accurately answer when executing user commands?
2. Can user commands be executed normally using only the summary of memory?

The first question prompt is shown in Figure 4, while the prompt for the second question is shown in Figure 5

如果controller 決定要使用historical memory，那memory retrieval就會被啟動。
當retrieving memory的時候，會用current observation(input)當作query，然後會使用 Recency and Relevance 來 evaluate memory’s rank score。
Recency: 輪次的時間
Relevance: 相似度，是用embedding實作，和current observation計算similarity。embedding的實作中會對 text description of every memory 產embedding，embedding的產出是用語言模型，語言模型是選擇使用 OpenAI embedding model text-embedding-ada-002

Memory Summarization

Memory summarization is a crucial aspect in scenarios such as document summarization, where a single interaction or dialogue turn can have a token length exceeding 3000. → 有了summarization才能有更長的長度。It enables stacking multiple memories into an activated memory section.

Figure 6 shows the English prompt that is specifically designed for memory summarization in individual interactions (i.e., dialogue tasks).

Response Generation

SCM能夠讓non-dialogue-optimized LLM去接近ChatGPT。透過activated和flash memory組成的prompt就可以生成想要的回應。 Figure 7 shows an English prompt intended for extremely long multi- interaction dialogues 。

Experiments

Our framework is preliminarily evaluated in two scenarios: ultra-long dialogues and ultra-long document summarization. We conduct experiments to answer three research questions (RQs)

• RQ1. Can SCM system compete with or even outperform ChatGPT within a specific token limit?

• RQ2. Can SCM system scale to provide accurate responses to users’ questions, which are related to historical contexts that date back hundreds or even thousands of turns?

• RQ3. Can SCM demonstrate generalization to other scenarios, including long document summarization?

The following experiment evaluates the performance of the text-davinci-003 model without dialogue optimization in comparison to the ChatGPT-Turbo model.

Qualitative Study

RQ1. Can SCM system compete with or even outperform ChatGPT within a specific token limit? → Figure 1
RQ2. Can SCM system scale to provide accurate responses to users’ questions, which are related to historical contexts that date back hundreds or even thousands of turns? → Figure 8

RQ3. Can SCM demonstrate generalization to other scenarios, including long document summarization? → Figure 9

驗證了三個research questions (RQs)都是能夠被解決的。

Limitations and Risks

Limitations

A lack of appropriate datasets for evaluating the handling of extremely lengthy texts has resulted in our model being validated solely through manual verification

we aim to construct a specific test set that incorporates various key indicators essential for processing long texts in diverse settings

we will assess the efficacy of our system on more open-source models that possess single-turn instruction comprehension capability.

Risks

Our system has the capability to attach to any LLMs, which may be prone to factual errors, delusions, toxic language, and malicious responses.

Consequently, we restrict the usage of our system to academic research purposes for now.

Future Work

Our future work will focus on releasing a comprehensive test set and its manual evaluation criteria, and testing our system on various open-source models currently available.