Building a memory layer for GPT using Function Calling

Simon Attard
7 min readJun 20, 2023

It is now easy to build a memory store using the new GPT function calling feature in conjunction with a vector store such as Chroma.

The OpenAI GPT 3.5 / 4 ‘0613’ (13th June) model updates shipped a very powerful new feature called function calling, which makes integration with your applications much easier and far more extensible.

First of all, the new models have been further fine-tuned to be able to decide when to call external functions, pass parameters and how to use the results returned.

The function calling feature allows you to pass a parameter describing your application’s function signatures to the LLMs, and they will then, in turn, decide on when and how to use your functions.

You can modify which functions the LLM can call at different times throughout your application’s interactions with the LLM.

A very basic example of this would be to define a function in Python which simply returns the current date and time based on a location parameter. You can then configure GPT to decide when to call this function if it requires the current time. An instance when the function would be called, might be if you ask GPT to set a task due date and it needs today’s date to calculate it.

Flow of Function Calling

The diagram above shows the flow of how to build using function calling:

  1. The application sends a prompt to the LLM for completion.
  2. The message will also include a function parameter, listing available functions and their usage.
  3. The message will also indicate whether a specific function must be called, or whether the model can automatically decide whether to call a function (set to auto).
  4. The model will respond with a completion having a finish_reason of stop or function_call.
  5. stop: the completion is ready and the model does not need to call the function/s,
  6. function_call: the function name (along with parameters) to be called will be provided in the response.
  7. The Application Layer uses the function name and parameters to invoke the function and get the return object.
  8. The return object it serialised into JSON and added to a new message with a function role. This is sent once again to the model for completion.
  9. Steps 2, 3 and 4 are repeated until a stop completion is received from the model.
  10. The final completion is processed by the application layer and sent back to the user.

Building the Memory Layer

In this article I will give a very basic example of how to give GPT 3.5 or GPT 4 memory which can span across user sessions. This example will use Chroma as a vector store for these memories and function calling to allow GPT to ‘decide’ when to store or retrieve memories. We will also allow the memories to be retrieved based upon semantic cosine similarity.

By using a vector database and cosine similarity, we could for example retrieve memories where food is mentioned when prompting GPT with a phrase such as ‘I am hungry’. The vector embeddings of the phrase ‘I am hungry’ will be close to phrases which contain the word apple or lunch.

Important Note: This blog post only aims to give a high level overview of a technique and does not make any recommendations on libraries or code. Do your own research on safety, licensing and suitability before using any techniques, code snippets, libraries or APIs referenced here.

#We will be using the cosine similarity implementation provided by Chroma.  
#The code below demonstrates how the cosine similarity can be computed
#for vectors A and B. (This is not the Chroma implementation)

import numpy as np
from numpy.linalg import norm
def cosine_similarity(A, B):
cosine = np.dot(A, B) / (norm(A) * norm(B))
return cosine

Basic Setup

  • Chroma db for storing vector collections
  • OpenAI Ada model to get vector embeddings of strings
  • Open AI GPT 3.5 turbo / GPT 4 0613 model versions for function calling

Step 1 — Setup Chroma and create a vector collection

import uuid
import chromadb
from chromadb.config import Settings

chroma_client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="/store/mem_store"
))
def get_or_create_collection(collection_name):
collection = chroma_client.get_or_create_collection(name=collection_name, metadata={"hnsw:space": "cosine"})
return collection

In the code block above we simply setup the Chroma client and create boilerplate code to get or create a vector collection. We also specify where to store the data on disk. The only important thing to note is that we are specifying cosine similarity as the metric to retrieving similar vectors.

Step 2 — Define a function to generate embeddings

import openai
import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")

def get_embedding(text, model="text-embedding-ada-002"):

tokens = tokenizer.encode(text)
#we can use tiktoken tokenizer to count tokens and ensure token limit is not exceeded. In this case we will simply pass text to ada v2 model.

return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

In this code block we are simply using the OpenAI create embedding library and using the ada v2 embedding model (which is now recommended by OpenAI for both search and similarity use cases over the obsolete v1 models).

The function simply takes a string and returns a vector embedding representing it. It is important that this function is used for all memory embeddings as well as search queries going forward.

(We could simply pass Chroma the strings and let it generate the embeddings automatically, but ada-002 should give us better results).

Step 3 — Define a function to store memories

def add_vector(collection, text, metadata):

id = str(uuid.uuid4())

embedding = get_embedding(text)
collection.add(
embeddings = [embedding],
documents = [text],
metadatas = [metadata],
ids = [id]
)

def save_memory(memory):
collection = get_or_create_collection("memories")
add_vector(collection, memory, {})
chroma_client.persist()

Above we simply get / create the collection, generate an embedding and then store the vector. The chroma client needs to be explicitly told to persist the data to disk.

A uuid is generated for each memory, but the metadata is being omitted for simplicity. In practice you would add metadata key value pairs to represent properties such as memory timestamp / memory expiry date / memory context etc.

Step 4 — Define a function to retrieve memories

def query_vectors(collection, query, n):

query_embedding = get_embedding(query)

return collection.query(
query_embeddings = [query_embedding],
n_results = n
)

def retrieve_memories(query):
collection = get_or_create_collection("memories")
res = query_vectors(collection, query, 5)

print(">>> retrieved memories: ")
print(res["documents"])
return res["documents"]

To retrieve memories, we simply pass the query, convert it into a vector embedding and search the vector store. The only parameter which would need to change here is n, which defines how many nearest neighbours (i.e. results) to return.

Step 5 — Setup OpenAI completion and helper methods

In the code blocks below; the first function process_input gets the completions from the model and processes the output. It checks the finish_reason and if it finds a function_call response then it chains the parameters returned into the correct function and makes a subsequent completion request until the model returns stop.

The second function get_completion is a standard get completion OpenAI boiler plate method but it shows how the definitions of the functions to be called is done.

def process_input(user_input):
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_input}
]
response = get_completion(messages)

if response.choices[0]["finish_reason"] == "stop":
reply_message = response.choices[0].message["content"].strip()
print(reply_message )

elif response. choices[0]["finish_reason"] == "function_call":
function_name = response.choices[0].message["function_call"].name
function_parameters = response.choices[0].message["function_call"].arguments
function_result = ""

if function_name == "save_memory":
function_result = save_memory(function_parameters)
elif function_name == "retrieve_memories":
function_result = retrieve_memories(function_parameters)

messages.append(
{
"role": "assistant",
"content": None,
"function_call": {"name": function_name, "arguments": function_parameters,},
}
)
messages.append(
{
"role": "function",
"name": function_name,
"content": f'{{"result": {str(function_result)}}}'
}
)
response = get_completion(messages)

return reply_message
def get_completion(messages):

functions = [
{
"name": "save_memory",
"description": """Use this function if I mention something which you think would be useful in the future and should be saved as a memory.
Saved memories will allow you to retrieve snippets of past conversations when needed.""",
"parameters": {
"type": "object",
"properties": {
"memory": {
"type": "string",
"description": "A short string describing the memory to be saved"
},
},
"required": ["memory"]
}
},
{
"name": "retrieve_memories",
"description": """Use this function to query and retrieve memories of important conversation snippets that we had in the past.
Use this function if the information you require is not in the current prompt or you need additional information to refresh your memory.""",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The query to be used to look up memories from a vector database"
},
},
"required": ["query"]
}
},
]
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0613",
#model="gpt-4-0613",
messages=messages,
functions=functions,
max_tokens=200,
stop=None,
temperature=0.5,
function_call="auto"
)

return response

Conclusion

By running the code above, we can test the python script and see how and when the GPT model uses the functions defined to store and retrieve memories.

For example, if I now I ask GPT to remember the day of my birthday, it will automatically choose to call the store_memory function and save that information for future use.

I can now start a new session, or point to a different model, and ask whether it remembers when my birthday is — and it will automatically choose to call the retrieve_memories function. It also passes a suitable search query related to birthdays.

Finally, if I ask a question for which it does not need to access the external memory store, it will choose not to call any functions and simply respond immediately.

--

--

Simon Attard

Tech company co-founder, software developer and product manager. Writing about AI, LLMs and software development. www.linkedin.com/in/simonattard/