Adding Memory to your Streamlit Chatbot App with Chat Elements and Snowflake Cortex

Large Language Models (LLMs) are composed by Neural Networks that can predict what will be next word given a text provided as input. They are great at generating text given a context. Chatbots have become a very popular tool to interact with your data and provide a conversation style. Chatbots make calls to LLMs (GPT-4, Mistral, Llama, etc) to generate answers for the questions formulated and techniques like Retrieval Augmented Generation (RAG) are used to provide the right context to answer those questions.

Bad news is that each call made to a LLM is independent of the previous one. LLMs have short-term memory and do not remember the previous question. They just answer each question independently using their long-term memory (the data they were trained on). Therefore, it is the application that is making calls to LLMs who is in charge of remembering the conversation, or in other words, providing the memory. As one example, the popular ChatGPT tool is in charge of providing memory when making calls to popular OpenAI LLMs.

Snowflake Cortex provides access to LLMs from Mistral AI, Meta and more within Snowflake so data does not have to leave their security perimeter. Streamlit in Snowflake allows you to build an App in a few minutes and Chat Elements provides you the toolkit to create chat interfaces very easily. Dash Desai and I have expanded the Cortex LLM quickstart to show an example on how to build a ChatBot UI that remembers the previous conversations.

You can try that yourself, but let’s review some key concepts here. This blog How to Give Your Chatbot More Memory discusses several approaches to store previous conversations. In general we want to build something like this that understand about our products so we can ask questions:

But what is unique about that chat? If you ask the question “What is the name of the ski boots?” to any LLM you will get a very generic answer. But this Chatbot is using our own documents to answer with data relevant to our business. We have created some fake documents here with very specific (and not real) information that is unique to us.

The chatbot is providing a very specific answer and the following questions keep related to the previous ones. This is what would happen if the Chatbot App does not take into account the chat history:

Here the first answer is correct, as it finds the right document which is provided as context to the LLM. But the next question does not use the previous chat history, so it just returns context related from documents to the question“Where has been tested?”, which we see include documents for bikes and ski boots.

In our Chatbot App, we need to remember the previous conversation but also find the right piece (chunk) of the document that will help to answer the last question presented. This is the architecture developed in the quickstart guide and that you can test yourself.

In this version of RAG, we are using a slide window (choosing up to a limit of previous conversation) with two purposes:

  • Create a prompt with that history plus the last question to summarize it and generate a new question that includes the latest one plus the previous content. We are using that output to embed it and search for a similar chunk
  • Provide that chat history in the prompt plus the relevant chunks

If we use the previous example, we get something like this:

Therefore, we are not vectorizing “Where have been tested?” but the output of the LLM call “What is the place where the TDBootz Special ski boots have been tested?”. That will provide a chunk of text from our documents related to the subject we are referring to in the conversation.

There are many more approaches to provide memory to your Chatbot and will depend on your use case. Maybe you don’t want to lose any context for the questions asked but as being limited by the token window size, you may want to summarize the whole chat history and provide it as context. It would be an architecture like this:

You can even leverage Snowflake governance capabilities to store all customer iterations and use it for long term memory but making sure conversations are unique for each customer. Therefore the Chatbot will vectorize previous questions and conversations and that can be used for new questions.

There are endless possibilities to provide memory for your Chatbot App. You also may want to use a combination of different agents to provide advanced RAG techniques with a combination of key search plus semantic search.

Another topic would be whether all your employees would have access to all the documents available for doing RAG. Snowflake governance capabilities would make possible to establish boundaries about what data can be used depending on who is asking the question. My colleague Mats Stellwall has written this nice article about Handling sensitive data with LLMs.

It is time for you to get your hands dirty, create a Snowflake trial account and test this yourself and experiment with different techniques. The Build a Retrieval Augmented Generation (RAG) based LLM assistant using Streamlit and Snowflake Cortex quickstart guide is a good place to start.

Enjoy!

Carlos.-

Note: Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

--

--