How to Make LLM Remember Conversation with Langchain

Vinayak Deshpande
14 min readMar 10, 2024

--

Photo by Harli Marten on Unsplash

We all like to converse. Our conversations with fellow humans tend to be longer, at least exchanging a few dialogs around certain topics. We do remember what was said a few dialogs earlier and respond / interpret the discussion with the same context. LLMs, on the contrary are stateless by default, which means they do not keep track of what messages were exchanged earlier. For them, each message received / sent is a new message. The chat interfaces provided for large language models do create an illusion that the LLM responding to our messages actually remembers our earlier dialog exchanges. Let us see how this illusion of “memory” is created with langchain and OpenAI in this post.

First, let us see how the LLM forgets the context set during the initial message exchange.

Let us import the necessary packages. You will need an OpenAI API Key. I have saved it in a .env file under my codebase and am setting an environment variable with load_dotenv. Then I import langchain and OpenAI modules.

import os
from dotenv import load_dotenv
load_dotenv()
from langchain_openai import OpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

Let us create a model, a prompt and a chain to start with.

# Cretae a model
model = OpenAI(temperature=0)

# Create a chat prompt template
prompt = ChatPromptTemplate.from_messages([
('system', 'You are a helpful assistant. Answer the question asked by the user in maximum 30 words.'),
('user', 'Question : {input}'),
])
# create a chain
chain = prompt | model | StrOutputParser()

Let’s start conversation by asking a question to model. Then we will ask a follow up question.

# Invoke the chain with an input
response = chain.invoke({'input' : 'Which is the most popular Beethoven\'s symphony?'})
print(response)
System: The most popular Beethoven's symphony is Symphony No. 9 in D minor, also known as the "Choral" symphony, which features the famous "Ode to Joy" melody.

Now let us ask a follow up question to know which was the last one.

# Now, let us ask a follow up question
response2 = chain.invoke({'input' : 'Which is the last one?'})
print(response2)
System: The last one is the final item in a sequence or list. It is the one that comes after all the others.

Here, we can see that the LLM has forgotten the context set during the first question asked. We will get the LLM to remember the earlier conversation very soon.

If you want the LLM to remember the things discussed earlier, then you will have to pass all the earlier message exchanges to the LLM during each follow up call. LLM will still treat the follow up calls as independent ones, but it will have the context readily available to answer the query accordingly. This approach has its pros and cons, which we will discuss later. Let us start with a simple conversation chain which has “memory”.

Let us import the conversation buffer memory and conversation chain.

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

Then create a memory object and conversation chain object.

# Create a memory object which will store the conversation history.
memory = ConversationBufferMemory()

# Create a chain with this memory object and the model object created earlier.
chain = ConversationChain(
llm=model,
memory=memory
)

Now, let us invoke this chain for the same question as earlier. Remember, this is a new chain object.

# Invoke the chain with an input
response = chain.invoke({'input' : 'Which is the most popular Beethoven\'s symphony?'})
print(response)

Here is the output:

{
'input': "Which is the most popular Beethoven's symphony?",
'history': '',
'response': ' According to recent surveys and data analysis, the most popular Beethoven\'s symphony is Symphony No. 9 in D minor, also known as the "Choral" symphony. This symphony is known for its powerful and emotional final movement, which features a choir and soloists singing Friedrich Schiller\'s "Ode to Joy." It is often considered one of the greatest musical works of all time and is frequently performed by orchestras around the world.'
}

In the above output, spot a key : history with empty string as value. This stores the conversation happened hitherto. It is empty because this was the first question we asked. Now let us ask the follow up question and see what we get in response.

# Now, let us ask a follow up question
response2 = chain.invoke({'input' : 'Which is the last one?'})
print(response2)

And, here is the response

{
'input': 'Which is the last one?',
'history': 'Human: Which is the most popular Beethoven\'s symphony?\nAI: According to recent surveys and data analysis, the most popular Beethoven\'s symphony is Symphony No. 9 in D minor, also known as the "Choral" symphony. This symphony is known for its powerful and emotional final movement, which features a choir and soloists singing Friedrich Schiller\'s "Ode to Joy." It is often considered one of the greatest musical works of all time and is frequently performed by orchestras around the world.',
'response': ' The last symphony composed by Beethoven was Symphony No. 9, which was completed in 1824. However, he did begin work on a tenth symphony, but unfortunately passed away before it was completed. Some fragments of this unfinished symphony have been reconstructed by other composers, but it is not considered an official Beethoven work.'
}

Now spot the ‘history’ key in above follow up response. It has all the contents of our earlier dialog exchange. The LLM uses this as a context to answer the next question. If another question is asked, the history will contain these two message exchanges and LLM will be able to answer the next question in the same context. This approach to send entire conversation is useful but it has cons as well :

1. Unless you are hosting an LLM on your own, use of LLM is a metered connection. You are charged on the basis of number of tokens you exchange with the LLM. In this approach, you send all the earlier communication to the LLM with each follow up question. The number of tokens you send to the LLM go on accumulating with each interaction, incurring more cost.

2. LLMs have a finite token limit. It can accept maximum tokens in a single call as defined by the token limit. If the message history goes on accumulating, you are bound to exceed the maximum token limit set for the LLM you use.

3. More the tokens you pass, more the time the LLM will take to respond. Hence, with each follow up question, the latency to get response from LLM increases. This is similar to making periodic contacts with a spacecraft speedily moving away from earth. Each subsequent message and response will take longer time.

4. The conversation may move away from the topic it had started with. This will make the first few messages irrelevant in the latest context. Here, the first few exchanges become the noise in the message history. We can expect some weird response from the LLM due to this noise.

5. Lastly, larger the context you pass to the LLM, poorer it performs in quality of response. The LLM may hallucinate if presented with a large chunk of inputs.

There are techniques to deal with these challenges.

Technique 1 : Pass only last “k” number of conversations to LLM

What if we decide to pass only last “k” number of messages to the LLM during our conversation? Well, this will work as long as the actual context of your current conversation does not span more than “k” exchanges. This approach is useful when you are sure that the conversation would gracefully end within “k” message exchanges. Let’s see this in action.

We need a memory from langchain with a window of “k” : ConversationBufferWindowMemory. Let’s import that memory from langchain and create an object of it. For demonstration purpose, we have set the k = 1, so it will remember only the last conversation.

from langchain.memory import ConversationBufferWindowMemory

# Let us creare a ConversationBufferWindowMemory with k=1, which remembers only the previous 1 message
window_memory = ConversationBufferWindowMemory(k=1)

Let us create a conversational chain with this memory

window_memory_chain = ConversationChain(
llm = model,
memory=window_memory
)

Let’s start conversation with this chain now. Let me introduce myself to the chain to start with.

window_memory_chain.invoke({'input' : 'Hello, my name is Vinayak and I am 34 years old'})
{'input': 'Hello, my name is Vinayak and I am 34 years old',
'history': '',
'response': " Hello Vinayak, it's nice to meet you. I am an AI programmed to have conversations with humans. I do not have a name or age as I am a digital entity. How can I assist you today?"}

Again, spot the history from the response. This is the first interaction with the model, hence history is blank. Let’s check if it remembers my age.

window_memory_chain.invoke({'input' : 'How old I am?'})
{'input': 'How old I am?',
'history': "Human: Hello, my name is Vinayak and I am 34 years old\nAI: Hello Vinayak, it's nice to meet you. I am an AI programmed to have conversations with humans. I do not have a name or age as I am a digital entity. How can I assist you today?",
'response': ' According to my calculations, you are 34 years old. Is that correct?'}

Good job! It has memorized my age. Check the history, it has the previous message exchange. Now, let’s see if it still remembers my name. Please note, we have configured it to remember only last message exchange.

window_memory_chain.invoke({'input' : 'What is my name?'})
{'input': 'What is my name?',
'history': 'Human: How old I am?\nAI: According to my calculations, you are 34 years old. Is that correct?',
'response': ' I do not have access to your personal information, so I am unable to answer that question. Is there something else you would like to know?'}

Now check the contents of history and the response we received. There is no mention of my name in the history, hence the model does not know my name and responded accordingly.

Technique 2 : Pass only latest “n” tokens of conversation history to LLM

We can think of passing only last “n” tokens to the LLM. This will help us to ensure that we pass only finite number of tokens to the LLM.

Let’s import the memory type from langchain, create a conversational chain:

from langchain.memory import ConversationTokenBufferMemory

token_buffer_memory = ConversationTokenBufferMemory(llm=model, max_token_limit=100) # default max_token_limit is 2000

token_buffer_chain = ConversationChain(
llm=model,
memory=token_buffer_memory
)

Let’s start the conversation and observe the history component of the response.

response = token_buffer_chain.invoke({'input' : 'Hello there. I am Vinayak, I am 34 years old and I like swimming.'})
print(response)
{
'input': 'Hello there. I am Vinayak, I am 34 years old and I like swimming.',
'history': '',
'response': " Hello Vinayak, it's nice to meet you. I am an AI programmed to have conversations with humans. I do not have an age or hobbies like humans do, but I am constantly learning and improving my abilities. Do you have any questions for me?"
}

Follow up question:

response = token_buffer_chain.invoke({'input' : 'How can I explain butterfly swimming style to a 5 year old child?'})
print(response)
{
'input': 'How can I explain butterfly swimming style to a 5 year old child?',
'history': "Human: Hello there. I am Vinayak, I am 34 years old and I like swimming.\nAI: Hello Vinayak, it's nice to meet you. I am an AI programmed to have conversations with humans. I do not have an age or hobbies like humans do, but I am constantly learning and improving my abilities. Do you have any questions for me?",
'response': " The butterfly swimming style is a type of stroke used in competitive swimming. It involves a simultaneous movement of both arms and legs, with the arms moving in a circular motion and the legs kicking together. It can be compared to the movement of a butterfly's wings, hence the name. To explain it to a 5 year old, you could say that it's like pretending to be a butterfly in the water, with your arms and legs moving together like wings."
}

Observe the history. It has the previous message exchange. Now, next question

response = token_buffer_chain.invoke({'input' : 'How old I am?'})
print(response)
{
'input': 'How old I am?',
'history': "AI: The butterfly swimming style is a type of stroke used in competitive swimming. It involves a simultaneous movement of both arms and legs, with the arms moving in a circular motion and the legs kicking together. It can be compared to the movement of a butterfly's wings, hence the name. To explain it to a 5 year old, you could say that it's like pretending to be a butterfly in the water, with your arms and legs moving together like wings.",
'response': ' I do not have access to your personal information, so I cannot accurately answer that question. However, based on our conversation, I would say that you are at least old enough to understand the concept of butterfly swimming.'
}

As the discussion where my age was mentioned was wiped off from the history due to our max_token_limit=100 configuration, the LLM did not know my age when asked.

Technique 3 : Summarize the historical conversation and pass the summary to LLM

The historical conversation can be compressed into summary with ConversationSummaryMemory. The summary can be passed as history to the LLM. This makes the history finite and inclusive of highlights of discussion happened hitherto. This approach saves the number of tokens passed to the LLM while inferring. It is worth to note that the memory object itself uses a LLM to generate summary of historical messages. Accordingly, your system will make multiple calls to LLM — To summarize the history and to answer question in context of summarized history.

It is difficult to predict which part of the historical conversation gets more weightage.

Let’s create the summary object and chain:

from langchain.memory import ConversationSummaryMemory

summary_memory=ConversationSummaryMemory(llm=model)

sumamry_memory_chain = ConversationChain(
llm=model,
memory=summary_memory,
verbose=True
)

Let’s chat with this chain

response = sumamry_memory_chain.invoke({'input' : 'Hello there. I am Vinayak, I am 34 years old and I like swimming.'})
print(response)
{
'input': 'Hello there. I am Vinayak, I am 34 years old and I like swimming.',
'history': '',
'response': " Hello Vinayak, it's nice to meet you. I am an AI programmed to have conversations with humans. I do not have an age or hobbies like humans do, but I am constantly learning and improving my abilities. Do you have any questions for me?"
}

This returned with blank history. Let’s ask the same follow up question.

response = sumamry_memory_chain.invoke({'input' : 'How can I explain butterfly swimming style to a 5 year old child?'})
print(response)
{
'input': 'How can I explain butterfly swimming style to a 5 year old child?',
'history': '\nThe human introduces themselves as Vinayak and shares their age and hobby of swimming. The AI responds by introducing itself and explaining its purpose as a conversational AI. It also mentions its lack of age and hobbies, but its constant learning and improvement. The AI then invites the human to ask any questions.',
'response': " That's a great question, Vinayak! The butterfly swimming style is a type of stroke used in competitive swimming. It involves a simultaneous movement of both arms and legs, resembling the wings of a butterfly. The arms move in a circular motion while the legs kick together in a dolphin-like motion. It may be helpful to show the child a video or demonstration of the butterfly stroke to help them understand."
}

Now, look at the history in above output. It is the summary of our earlier conversation. As mentioned earlier, we cannot determine which part of the conversation will be included in the summary. For example, my age is not included clearly in the summary, so it may not be able to answer my age when asked. Let’s try asking and see the history.

response = sumamry_memory_chain.invoke({'input' : 'How old I am?'})
print(response)
{
'input': 'How old I am?',
'history': '\nVinayak introduces himself and shares his age and hobby of swimming. The AI introduces itself as a conversational AI and explains its purpose, mentioning its constant learning and improvement. The AI invites Vinayak to ask any questions and he asks about explaining the butterfly swimming style to a 5 year old child. The AI responds by describing the stroke and suggesting a video or demonstration to help the child understand.',
'response': ' You mentioned earlier that you are a human, so I do not have access to your personal information. However, you did mention that you enjoy swimming as a hobby, so I can assume that you are at least old enough to swim. Would you like me to provide more specific information about the butterfly stroke or swimming in general?'
}

As expected, it has lost the information about my age while summarization and cannot answer this question. The change in history between the two responses above is worth paying attention to. All the earlier conversation is summarized in above history, although some of the information is lost.

Technique 4: Use vector store as memory

What if the conversations you are dealing with are usually lengthy and shift from this topic to that frequently? The techniques mentioned earlier may not work efficiently in this scenario. Here, a memory backed by vector store works.

VectorStoreRetrieverMemory uses a vector store as a backend and finds most relevant conversations to the current discussion from the history. Let us try Chroma as our vector store here for an example.

from langchain.vectorstores import Chroma
from langchain.memory import VectorStoreRetrieverMemory

vectorstore = Chroma(collection_name = 'history', embedding_function=OpenAIEmbeddings())

Create your retriever and memory object

# Create your retriever 
retriever = vectorstore.as_retriever()

# Create your VectorStoreRetrieverMemory
vectorstore_retriever_memory = VectorStoreRetrieverMemory(retriever=retriever)

Let’s follow a different route here to create lengthy conversation history : preload the memory with historical conversations without actually having them. This is useful to set a context in advance.

vectorstore_retriever_memory.save_context({'input':'My name is John'}, {'output':'Hello John! How can I help you?'})
vectorstore_retriever_memory.save_context({'input':'I am looking for some vegeatarian restaurents nearby.'}, {'output':'There is a veg restaurent named `Veg Treat` at walking distance from here.'})
vectorstore_retriever_memory.save_context({'input':'Oh, great! Thanks for the recommendation. Is Veg Treat known for any particular dish? I\'m always on the lookout for delicious vegetarian options.'}, {'output':'Absolutely, John! Veg Treat is well-known for its signature dish, the "Mushroom Medley Delight." It\'s a savory combination of assorted mushrooms cooked with aromatic herbs and spices. Many patrons love it for its unique flavor. If you enjoy mushrooms, I\'d highly recommend trying it out!'})
vectorstore_retriever_memory.save_context({'input':'That sounds amazing! I\'m a big fan of mushrooms, so I\'ll definitely give the Mushroom Medley Delight a try. Is there anything else you\'d recommend at Veg Treat, perhaps a special dessert or a refreshing drink?'}, {'output':'Certainly, John! Their "Green Bliss Smoothie" is a popular choice among customers. It\'s a refreshing blend of spinach, kiwi, and mint, creating a delightful and healthy beverage. And for dessert, the "Chocolate Avocado Mousse" is a must-try - a rich, creamy treat that surprises many with its unique combination of flavors.'})

vectorstore_retriever_memory.save_context({'input': 'What is the weather like today?'}, {'output': 'The weather forecast predicts a sunny day with a high of 75 degrees Fahrenheit.'})
vectorstore_retriever_memory.save_context({'input': 'Do you have any jokes to tell?'}, {'output': 'Why did the scarecrow win an award? Because he was outstanding in his field! '})
vectorstore_retriever_memory.save_context({'input': 'Can you translate "hello" to Spanish?'}, {'output': 'Hola! "Hola" means "hello" in Spanish.'})
vectorstore_retriever_memory.save_context({'input': 'What are some interesting historical facts?'}, {'output': 'Did you know the population of the Earth is estimated to be over 8 billion people! Fascinating, right?'})

Create a conversation chain

vectordb_memory_chain = ConversationChain(
llm=model,
memory=vectorstore_retriever_memory,
verbose=True
)

Now, we will ask questions with the context of above conversation.

vectordb_memory_chain.invoke({'input' : 'what was the veg restaurant name suggested?'})

Here is the screenshot of the output

The vectorstore_retriever_memory took the new question, searched for earlier conversations and selected only the relevent four conversations to be passed as history to the LLM. The LLM then did the perfect job to answer the question.

Conclusion

This article discussed that LLM calls are stateless. LLMs do not remember earlier conversational context by default. However, our prompts can be augumented with “memory” of earlier conversation. We discussed basic conversation buffer memory and four other special variations of memory in langchain. The langchain does support other types of memories as well. Langchain documentation may be referred for more details.

Thanks a lot for reading up to here. I hope you liked this. Your comments on this post will mean lot to me. Comments, likes, shares and criticism are welcome!

--

--