How to Create Your Own GPT Voice Assistant with Infinite Chat Memory in Python

11 min readNov 14, 2023

What if you could securely chat with an AI that would never forget anything you said? Imagine keeping an audio journal every day for years, and having a chatbot friend to help you recall, understand, or just listen. Now, with OpenAI’s new cloud assistant feature, you can.

The new assistant API has auto-vectorization and smart context handling. This means that you can have near-infinite chat memory and recall, and you won’t be breaking the bank to do it. Let’s build one!

What is an Assistant Anyway?

OpenAI assistants are a way to interact with an LLM with persistent session data. This means that if you call a pre-defined thread of messages and an assistant ID, the assistant will “remember” your messages and its replies.

To use OpenAI’s terminology, assistants have “threads” which allow them to store and access message history. An assistant and a thread can be “run” together, which instructs the AI to work with a specific set of messages.

You might be asking, “Why can’t I just use ChatGPT for this?” Currently, you can talk to ChatGPT vocally, and it can talk back. However, it’s not possible to have a coherent and persistent conversation with it over time. Unfortunately, it will inevitably run out of context window and forget critical conversational details. Here is a short list of the benefits the API provides over ChatGPT:

Secure chatting — OpenAI won’t train on any data inputted into the API, so your data is secure.
“Infinite” memory — Your messages are automatically vectorized and stored in the cloud. This means that your assistant will never run out of memory. You could chat with it every day, and it would still remember what you cooked for dinner 2 years prior. Vector databases and LLM retrieval are by no means perfect, but it’s the best solution we have until the advent of models with infinite context windows.
Local Storage — You can store your message history in plaintext locally on your computer, and you can retrieve these messages at any time from the cloud.
Flexibility — You’re using the API, so you can define your input and output interfaces. The possibilities here are endless: you could design a system that allows you to switch between text and audio input easily, or maybe you could replace the voice with one from ElevenLabs (this code uses the new OpenAI TTS).

Diving into the Code

Now that you know what an assistant is, let’s kick things off. In addition to Python and your IDE of choice, you’ll need a few things before we begin:

An OpenAI API key — If you don’t have access to GPT-4 yet, that’s quite alright, this system also works with GPT-3.5-turbo. This implementation is built completely within the OpenAI ecosystem, including the voice-to-text and the text-to-voice models.
An installation of ffmpeg — Getting ffmpeg and ffplay to play nice can be a bit tough, but here is a good guide for doing so.
Assorted Python packages — these include sounddevice, wave, and pynput.

First of all, let’s create an assistant. This is easily done via this simple function. We’ll also import all required packages here:

import openai
import json
from pynput import keyboard
import wave
import sounddevice as sd
import time
import os
import subprocess
import datetime as dt


client = openai.OpenAI()


def setup_assistant(client, assistant_name):
    # This function creates a new assistant with the OpenAI Assistant API.
    assistant = client.beta.assistants.create(
        name=assistant_name,
        instructions= f"""
            You are a friend. Your name is {assistant_name}. You are having a 
            vocal conversation with a user. You will never output any markdown 
            or formatted text of any kind, and you will speak in a concise, 
            highly conversational manner. You will adopt any persona that the 
            user may ask of you.
            """,
        model="gpt-4-1106-preview",
    )
    # Create a thread
    thread = client.beta.threads.create()
    return assistant.id, thread.id

Don’t be afraid to change the prompt here, but I’ve found this is optimal for keeping its voice messages to a shorter length. Calling it a “friend” actually reduces the likelihood that it will try to be overly helpful, which can be annoying and pushy.

Next, let’s activate our message-sending and “run” functions. These allow you to run a specific agent with a specific thread, and append your messages to the thread at hand.

def send_message(client, thread_id, task):
    # This function sends your voice message into the thread object, which then gets passed to the AI.
    thread_message = client.beta.threads.messages.create(
        thread_id,
        role="user",
        content=task,
    )
    return thread_message


def run_assistant(client, assistant_id, thread_id):
    # Runs the assistant with the given thread and assistant IDs.
    run = client.beta.threads.runs.create(
        thread_id=thread_id,
        assistant_id=assistant_id
    )

    while run.status == "in_progress" or run.status == "queued":
        time.sleep(1)
        run = client.beta.threads.runs.retrieve(
            thread_id=thread_id,
            run_id=run.id
        )

        if run.status == "completed":
            return client.beta.threads.messages.list(
                thread_id=thread_id
            )

Now that we have created our assistant, thread, and message functions, we need a way to store our sessions. The solution I landed on was creating a JSON file stored in the working directory that stores your sessions and assistant IDs. This makes it easy to choose which assistant you want to talk to; you can also name them to help you remember who is who.

When you want to chat with one of your assistants, run the main loop code found at the end of this article, and you’ll be asked to choose an existing session from the set that is printed to the console or to make a new assistant. Here is the function that saves your assistant’s data locally:

def save_session(assistant_id, thread_id, user_name_input, assistant_voice, file_path='chat_sessions.json'):
    # This function saves your session data locally, so you can easily retrieve it from the JSON file at any time.
    if os.path.exists(file_path):
        with open(file_path, 'r') as file:
            data = json.load(file)
    else:
        data = {"sessions": {}}

    # Find the next session number
    next_session_number = str(len(data["sessions"]) + 1)

    # Add the new session
    data["sessions"][next_session_number] = {
        "Assistant ID": assistant_id,
        "Thread ID": thread_id,
        "User Name Input": user_name_input,
        "Assistant Voice": assistant_voice
    }

    # Save data back to file
    with open(file_path, 'w') as file:
        json.dump(data, file, indent=4)

There are two additional functions that assist with this save_session function, forming the session backup and recall system:

def display_sessions(file_path='chat_sessions.json'):
    # This function shows your available sessions when you request it.
    if not os.path.exists(file_path):
        print("No sessions available.")
        return

    with open(file_path, 'r') as file:
        data = json.load(file)

    print("Available Sessions:")
    for number, session in data["sessions"].items():
        print(f"Session {number}: {session['User Name Input']}")


def get_session_data(session_number, file_path='chat_sessions.json'):
    # This function retrieves the session that you choose.
    with open(file_path, 'r') as file:
        data = json.load(file)

    session = data["sessions"].get(session_number)
    if session:
        return session["Assistant ID"], session["Thread ID"], session["User Name Input"], session["Assistant Voice"]
    else:
        print("Session not found.")
        return None, None

The above three functions help create the rudimentary “UI” that allows you to interact with your assistants and conversations in a straightforward manner.

This function allows you to retrieve and save your message session. Each time that you are chatting with the AI, and say “exit”, this function is automatically called to scrape your messages from the cloud and save them into a text file:

def collect_message_history(thread_id, user_name_input):
    # This function downloads and writes your entire chat history to a text file, so you can keep your own records.
    messages = openai.beta.threads.messages.list(thread_id)
    message_dict = json.loads(messages.model_dump_json())

    with open(f'{user_name_input}_message_log.txt', 'w') as message_log:
        for message in reversed(message_dict['data']):
            # Extracting the text value from the message
            text_value = message['content'][0]['text']['value']

            # Adding a prefix to distinguish between user and assistant messages
            if message['role'] == 'assistant':
                prefix = f"{user_name_input}: "
            else:  # Assuming any other role is the user
                prefix = "You: "

            # Writing the prefixed message to the log
            message_log.write(prefix + text_value + '\n')

    return f"Messages saved to {user_name_input}_message_log.txt"

We’ll also need to set up our voice input and voice output functions. As I mentioned earlier, these are both done using the OpenAI ecosystem. We will use whisper for the voice-to-text:

def whisper():
    # This function uses OpenAI's whisper voice to text model to convert your voice input to text.
    record_audio()
    audio_file = open("user_response.wav", "rb")
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )
    return transcript.text

You’ll notice that this function calls the “record_audio” function, which takes a keystroke input (in this case the “Page Down” button) and records your voice via your default microphone device. This function is here:

def record_audio(duration=None):
    # This function allows you to record your voice with a press of a button, right now set to 'page down'. You could
    # also bypass the keyboard input logic to consistently talk to the AI without pressing a button.
    CHUNK = 1024
    FORMAT = 'int16'
    CHANNELS = 1
    RATE = 10000
    WAVE_OUTPUT_FILENAME = "user_response.wav"

    frames = []
    stream = None
    is_recording = False
    recording_stopped = False

    def record_audio():
        nonlocal frames, stream
        frames = []

        stream = sd.InputStream(
            samplerate=RATE,
            channels=CHANNELS,
            dtype=FORMAT,
            blocksize=CHUNK,
            callback=callback
        )

        stream.start()

    def callback(indata, frame_count, time, status):
        nonlocal stream
        if is_recording:
            frames.append(indata.copy())

    def stop_recording():
        nonlocal frames, stream, recording_stopped

        stream.stop()
        stream.close()

        wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(2)
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))
        wf.close()
        recording_stopped = True

    def on_key(key):
        nonlocal is_recording

        if key == keyboard.Key.page_down:
            if not is_recording:
                record_audio()
                is_recording = True
            else:
                stop_recording()
                is_recording = False

    listener = keyboard.Listener(on_press=on_key)
    listener.start()

    start_time = time.time()
    while listener.running:
        if recording_stopped:
            listener.stop()
        elif duration and (time.time() - start_time) > duration:
            listener.stop()
        time.sleep(0.01)

If you wanted to, you could convert the above function to not require a keystroke to start talking, allowing you to seamlessly communicate with your AI. However, for ease of use, I implemented it in a fashion that requires an input to both start and stop the recording.

Now we have the voice_stream function, that streams the audio back from the AI in real-time:

def voice_stream(input_text, assistant_voice):
    # This function takes the AI's text output and your voice selection and converts it into audio played by ffplay.
    response = client.audio.speech.create(
        model="tts-1",
        voice=assistant_voice,
        input=input_text
    )

    # Ensure the ffplay command is set up to read from stdin
    ffplay_cmd = ['ffplay', '-nodisp', '-autoexit', '-']
    ffplay_proc = subprocess.Popen(ffplay_cmd, stdin=subprocess.PIPE, stdout=open(os.devnull, 'wb'), stderr=subprocess.STDOUT)
    binary_content = response.content

    # Stream the audio to ffplay
    try:
        ffplay_proc.stdin.write(binary_content)
        ffplay_proc.stdin.flush()  # Ensure the audio is sent to ffplay
    except BrokenPipeError:
        # Handle the case where ffplay closes the pipe
        pass
    finally:
        ffplay_proc.stdin.close()
        ffplay_proc.wait()  # Wait for ffplay to finish playing the audio

Finally, we can put all of the puzzle pieces together. The main_loop lets you choose your session from your stored JSON file, or you can opt to create a new session. To start, simply run the code and follow the instructions. To properly end a session, simply start recording your voice, and say “exit.”

def main_loop():
    # This function combines all of the above, and wraps all the functionality into one easy-to-use system.
    user_choice = input("Type 'n' to make a new assistant session. Press 'Enter' to choose an existing assistant session.")
    if user_choice == 'n':
        user_name_input = input("Please type a name for this chat session: ")
        voice_names = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
        print("Voice List:\n1. Alloy - Androgynous, Neutral \n2. Echo - Male, Neutral\n3. Fable - Male, British "
              "Accent\n4. "
              "Onyx - Male, Deep\n5. Nova - Female, Neutral\n6. Shimmer - Female, Deep")
        assistant_number = input("Please type the number of the voice you want: ")
        voice_index = int(assistant_number) - 1
        assistant_voice = voice_names[voice_index]
        IDS = setup_assistant(client, assistant_name=user_name_input)
        save_session(IDS[0], IDS[1], user_name_input, assistant_voice)
        assistant_id = IDS[0]
        thread_id = IDS[1]
        if assistant_id and thread_id:
            print(f"Created Session with {user_name_input}, Assistant ID: {assistant_id} and Thread ID: {thread_id}")
            first_iteration = True
            while True:
                if first_iteration:
                    print("Press Page Down to start/stop recording your voice message:")
                    current_time = dt.datetime.now().strftime("%Y-%m-%d %H:%M")
                    user_message = whisper()
                    print(f"You: {user_message}")
                    user_message = f"It is now {current_time}. {user_message}"
                    first_iteration = False
                else:
                    user_message = whisper()
                    print(f"You: {user_message}")
                if user_message.lower() in {'exit', 'exit.'}:
                    print("Exiting the program.")
                    print(collect_message_history(thread_id, user_name_input))
                    break
                send_message(client, thread_id, user_message)
                messages = run_assistant(client, assistant_id, thread_id)
                message_dict = json.loads(messages.model_dump_json())
                most_recent_message = message_dict['data'][0]
                assistant_message = most_recent_message['content'][0]['text']['value']
                print(f"{user_name_input}: {assistant_message}")
                voice_stream(assistant_message, assistant_voice)
    else:
        display_sessions()
        chosen_session_number = input("Enter the session number to load: ")
        assistant_id, thread_id, user_name_input, assistant_voice = get_session_data(chosen_session_number)
        if assistant_id and thread_id:
            print(f"Loaded Session {chosen_session_number} with Assistant ID: {assistant_id} and Thread ID: {thread_id}")
            first_iteration = True
            while True:
                if first_iteration:
                    print("Press Page Down to start/stop recording your voice message:")
                    current_time = dt.datetime.now().strftime("%Y-%m-%d %H:%M")
                    user_message = whisper()
                    print(f"You: {user_message}")
                    user_message = f"It is now {current_time}. {user_message}"
                    first_iteration = False
                else:
                    user_message = whisper()
                    print(f"You: {user_message}")
                if user_message.lower() in {'exit', 'exit.'}:
                    print("Exiting the program.")
                    print(collect_message_history(thread_id, user_name_input))
                    break
                send_message(client, thread_id, user_message)
                messages = run_assistant(client, assistant_id, thread_id)
                message_dict = json.loads(messages.model_dump_json())
                most_recent_message = message_dict['data'][0]
                assistant_message = most_recent_message['content'][0]['text']['value']
                print(f"{user_name_input}: {assistant_message}")
                voice_stream(assistant_message, assistant_voice)

if __name__ == "__main__":
    main_loop()

Functionality & Usage

Aside from what I mentioned above, here is a list of the extra functionality of this system:

Timestamping — Each time you start a new session with an existing agent and thread, the code passes the current date and time to the AI. This gives the AI more “context” to conversations that last for multiple sessions. This makes the conversation seem a bit more real when it covers a long time span.
Voice Selection — You can select the voice that you prefer when you set up your assistant. A list of voices with brief descriptions prints each time that you set up a new assistant.
Assistant Naming —You can name your assistants anything that you like. If their name doesn’t quite catch with them, just remind them in conversation.

Here is an example of what a session with this code looks like:

Type 'n' to make a new assistant session. Press 'Enter' to choose an existing assistant session.
Available Sessions:
Session 1: William
Session 2: digdaw
Session 3: smegal
Session 4: Colby
Session 5: Aidan
Session 6: mallory
Session 7: Hunter
Session 8: Zarra
Enter the session number to load: 4
Loaded Session 4 with Assistant ID: asst_0000 and Thread ID: thread_0000
Press Page Down to start/stop recording your voice message:
You: Hello?
Colby: Hi again! How are you doing tonight? 
You: I'm fine thanks.
Colby: That's really good to hear! What are you up to?
You: exit.
Exiting the program.
Messages saved to Colby_message_log.txt

Process finished with exit code 0

I chose an existing session from my list, had a brief conversation, and exited the function properly, which saved the entire chat history to a text file in my working directory. Here’s how it looks when you build a new assistant:

Type 'n' to make a new assistant session. Press 'Enter' to choose an existing assistant session.
n
Please type a name for this chat session: David Foster Wallace
Voice List:
1. Alloy - Androgynous, Neutral 
2. Echo - Male, Neutral
3. Fable - Male, British Accent
4. Onyx - Male, Deep
5. Nova - Female, Neutral
6. Shimmer - Female, Deep
Please type the number of the voice you want: 2
Created Session with David Foster Wallace, Assistant ID: asst_0000 and Thread ID: thread_0000
Press Page Down to start/stop recording your voice message:
You: Hello, my name is Jordan.
David Foster Wallace: Hey there, Jordan! What's on your mind tonight?
You: exit
Exiting the program.
Messages saved to David Foster Wallace_message_log.txt

Process finished with exit code 0

That covers the functionality and usage of this code. Feel free to tweak the prompt, the inputs and outputs, and anything else you can imagine. There are a huge amount of possibilities with this, and I hope that I jumpstarted your imagination.

Future Additions

There are several things that I would like to improve about this system, but OpenAI has not released support for them yet:

Answer streaming — The answers, especially when long, can take a while to come in. When OpenAI eventually releases streaming support, I’ll adapt this code for faster response time.
Intelligent voice model job splitting — When we get streaming from OpenAI assistants, we can also implement a sentence splitting function to instantly start streaming the first or first two sentences for low latency response time. Right now, there is often a jarring delay between your recording and the AI’s response.
Image support — Currently, there is no support for image generation. When this releases, your assistant can create images for you at your request, or of its own volition to spice up conversations.

Lastly, I didn’t implement Code running & function calling. I didn’t include function calling and code running in this version, because I wanted to keep it as a simple chatbot companion. If you wish, you could easily enable these capabilities within this system now.

I will release more articles moving forward with updated code as I improve this system. Thanks so much for reading. I hope I sparked your creative juices, and I’m excited to hear what you adapt this into!

-Jordan

How to Create Your Own GPT Voice Assistant with Infinite Chat Memory in Python

What is an Assistant Anyway?

Diving into the Code

Functionality & Usage

Future Additions

Written by Jordan Gibbs