Image created with MidJourney & DALL-E 2

How I’ve built a chatbot on a series of podcasts using GPT-3.5

Jens
10 min readMar 6, 2023

--

Generative AI promises to create a new interface for interacting with content: rather than listening to endless podcasts or reading entire corpora of text, a chatbot can give you a specific answer that is tailored to your specific needs. I’ve built a chatbot called ChatJoseph to create a new interface for interacting with a series of podcasts by Joseph Goldstein (a Buddhist meditation teacher). No need to spend hours searching for the lecture that answers your question, getting clarifications on concepts from a lecture, and getting a podcast recommendation based on a specific topic.

TLDR; How does the chatbot technically work? After preparing the data, the flow of the chatbot is relatively simple:

  1. Whenever the user asks a question, it will then run an “agent” to see what kind of question it is: is it a meditation-related question, is it a request for a podcast, is it a harmful question?
  2. Given that it is a meditation-related question, it will then perform a search query through 25 transcribed lectures of Joseph Goldstein and return X number of the most relevant pieces of information
  3. It then sends a prompt to OpenAI’s GPT-3.5 API endpoint, something along the lines of: “You are Joseph Goldstein, a meditation teacher, and Buddhist. Given the context (as created in step 2), can you answer the following question as truthfully as possible?”
  4. The answer is then given back to the user in a Next.js interface
  5. Other functions include a request for sources and a podcast recommendation based on the previous question

Stack

This article presents a tutorial on how to build a chatbot similar to ChatJoseph, using a series of lectures as a base. I’ve also written an article on the need for a new interface and the applications of generative AI for the future. I’ve created ChatJoseph using the following stack:

In this article, I describe how to transcribe a series of podcasts, and transform them into useful embedded chunks for our application. After, I’ll talk about the need for an agent and how to implement one. Then, I’ll show how to create the final prompt that will be send to OpenAI’s API endpoint. I say a few words about the interface I’ve created. Finally, I showcase some limitations and provide some useful links.

Data preparation & embedding

We first have to prepare your data, which in my case, was a series of Joseph Goldstein’s lectures from the Waking Up app. We transcribe the podcast, transform it into usable chunks of text, and then embed the chunks and store it in a vector database.

I used OpenAI’s Whisper (which is open-source) to transcribe the lectures, which took about 13h on my local MacBook Pro. I found the base model is more than accurate enough. Whisper returns the entire text as well as the transcription in ‘segments’ of about 15 seconds.

After running Whisper, you then transform the small segments into more useful chunks of text. We want to include full sentences in these chunks, as well as ensure they approximate a certain token length, which will be useful when creating the final prompt. We first transform the segments into full sentences by simply checking whether the end of a segment is a “.” or a “?”. This is because the segments of Whisper are already relatively well-segmented and will stop in the middle of a sentence or at the end of a sentence. Then, we will add these full sentences to a chunk until a max length of X tokens is reached (I used a max length of 250 tokens per chunk). Tokens are a way of segmenting the text; 1 token is equal to roughly 3/4 of a word (see more on tokens). I used HuggingFace’s GPT2TokenizerFast to count the amount of tokens in a sentence. The final output of these chunks is the text, podcast title, the starting time in the podcast, and the amount of tokens.

Once the transcript is transformed into usable text chunks, we are going to embed these text chunks. Why do we need to embed these chunks? When a user asks a question, we will also embed the question and then compare it with the database of the podcast chunks so we can return the most relevant pieces of the podcasts. Embeddings are a way of transforming a text (or image or another piece of data) into a number of vectors, which in our case are floats between 0 and 1 as it leverages the cosine similarity distance function. For a wonderful explanation and introduction to embeddings, I would recommend this video, or look into the learning resources of Pinecone.

I’ve used OpenAI’s text-embedding-ada-002 model for embeddings. Once the embeddings are done, you can either save the embeddings into a CSV for local use or use a vector database: I used Pinecone.io, but other options are weaviate.io, and even Supabase.com can now store vectors. Pinecone.io is free to a certain extent, and it took me just a few minutes to set up.

To see some actual code of how this entire process would look like, you can check out OpenAI’s cookbook examples, and in particular, this one on Olympics data.

Agent

For a simple Q&A bot, you could create a search query for the lectures and a simple prompt that turns those podcast chunks into an actual answer. Adding an agent would then not be necessary. However, after testing with some initial users, I realized that one configuration of settings would not always give a useful answer. There would be a lot of questions regarding the chatbot itself (e.g. Are you an AI chatbot?) or some questions that don’t directly get answered in the lectures (e.g. how do you feel about love?). For each of these categories of questions, you would want to change some of the settings: e.g. adding some context that is not featured in his lectures, such as a Wikipedia page, or increasing the temperature slightly. In addition, you want to prevent misuse, reverse prompt engineering, and any indications of harmful language. At first, I tried to run these functions in series; if the first configuration of settings does not give a useful reply, try the next one, then the next. But as a result, the query time would take incredibly long, and it started to get very hectic, having more than two functions. So, the solution: adding an “agent.”

The idea is simple: you first let an AI analyze the question and determine what kind of response is required: what kind of function is required. You give the agent a set of options (e.g. 1: A question about meditation, 2: A question about the assistant or creation of the application), and you add the conversation. You’d get the classification of the user’s request, and then you can call a specific configuration for the next steps. Additionally, you could use an agent not to classify only one action but look at a question or problem and propose a series of steps that are required to answer the question. For more info, check out LangChain.

I use OpenAI’s gpt-3.5-turbo to classify the user’s last message, seeing whether it’s a meditation-related question, a question about the creation of the application, a podcast recommendation, or a few other options. Based on the classification, I then link to various functions to adjust the settings of the final text generation request. Examples of these settings are high or low temperature, whether to include chunks of the podcast or not and how much history of the chat should be shown.

const header = `Please classify the message of the user in the conversation below. You are given the following options:`;
const options = actions.map(item => `${item.id}. ${item.title}`).join("\\n");
const assignment = `Please use a numerical classification ONLY. The last message of the user in the conversation above is classified as:`;
const promptMessages = [
{role: "system", content: header + options},
{role: "assistant", content: previousMessage},
{role: "user", content: currentMessage },
{role: "user", content: assignment}
];

Test your prompts in OpenAI’s playground and see if the classification works. I previously used davinci-003 completions model, but it seems the gpt-3.5-turbo works much better.

Creating the prompt

Although the agent can classify various options, the main goal is to answer meditation-related questions using the transcribed podcasts of Joseph Goldstein. To do that, we embed the chat input, find the most relevant podcast chunks by finding the dot product of the embeddings of the entire podcast series as well as the chat input, prepare the API request, and, finally, we use GPT-3.5’s chat model to answer the question.

We once again use OpenAI’s text-embedding-ada-002 to embed the chat input, which returns a list of vectors. If you’ve saved the embeddings of the podcast locally as a csv, we would then have to load the embeddings as a list. We then find the vector similarity using NumPy’s .dot() function. For more on finding vector similarity, read this pinecone blog. If you’re using Pinecone.io, you can use their API endpoint to perform a search, and it will return the top K results.

Using GPT’s Chat model, we can include a message with role: "system", giving instructions on how to respond. I’ve also included the podcast chunks as “context” in the system message. Aside from that, you can include a history of the chat by alternating between the role of user and assistant.

You can play around in OpenAI’s Playground to see how the system message affects the answer. I’ve tried to make it in a way that prevents (most of the) hallucinations. The header that I include in the system message for ChatJoseph is as follows:

const header = `You are Joseph Goldstein, an experienced mindfulness teacher and Buddhist. Answer as truthfully as possible using the provided context, in a friendly manner. Answer in a concise way. Also keep in mind the rest of the conversation. If the answer is not contained within the context below, say "I'm not sure how to respond."\\n\\nContext:\\n`;

After, we’ll add X amount of chunks. I’ve created a function that adds the top podcast chunks until a maximum of tokens is reached. You’ll have to make a trade-off between adding many podcast chunks and a large history of the chat. Also, the longer your request is, the slower and costlier it will be. I’m using the Agent function to determine what kinda question is and whether to include more or less history/chunks.

OpenAI expects a list of messages, and in the end, the promptMessages variable should look something like this:

promptMessages = [
{role: "system", content: header + chunks},
...chatHistory,
{role: "user", content: query}
];

After creating the prompt, we can now call OpenAI’s chat API. Here you can add the following parameters: temperature, max_tokens, and model. For factual answers, opt for a low temperature (which means adding some form of randomness to the completion). The max_tokens is for how long the response at a maximum can be. There are various other parameters; see OpenAI’s documentation.

To see how the entirety would look in Python, you can look at this OpenAI example.

Interface

So we have a functional Python script that can answer a question given the context of a series of podcasts/lectures. The next step would be to build an interface. For simplicity, I’ve used Next.JS and deployed it on Vercel so as not to have to deploy both a front-end and back-end. I’ve transformed all the prompt creation Python code to JavaScript and created a React front-end. I store all the answers (including the sources) in Redux and send the new message and the previous response (including sources) to the API endpoint.

Limitations

It’s already incredible what you can do in a short period with the current text-generation models. To create a well-functioning chatbot that not only gives human-like answers but useful answers is incredible. That said, it is very important to acknowledge current limitations.

I used GPT-3’s davinci model before, and the gpt-3.5-turbo model is already a major improvement. It gets the classification right more frequently, the answers are longer, and in general, more helpful. However, one result is that it is replying more frequently to non-meditation-related questions.

Screenshot of ChatJoseph AI

Of course, Joseph Goldstein never said anything about Taiwan during his lectures on Buddhism. On the other hand, this answer might be more useful than just having a reply as “I don’t know how to respond.” It raises the question of which answer is more desired. By editing the prompts, one could very likely change the response so it will not reply to things that Joseph Goldstein never talked about it.

Another issue is the fact that it is hard to tell which podcast chunks are actually relevant. Often podcasts refer to broader concepts and are highlighted in different ways. One could think that rather than just copying the top-K chunk results into the completion prompt, it could be useful to give larger chunks and summarize those chunks first before trying to answer a user’s question.

Useful links

There are many examples of code using GPT-3 out there, as well as libraries that can help you to set up a chatbot in minutes. Here are a few useful links that helped me to build ChatJoseph

Conclusion

Generative AI promises to create a new interface for interacting with content; no need to scour large libraries of content but just have the information that you need at your fingertips. This article presented a tutorial on how to build a chatbot similar to ChatJoseph, using a series of lectures as a base. By using audio data, transcribing and vectorizing it, using an agent, and drafting a response, we created an interface in Next.js that can answer meditation-related questions using the transcribed podcasts of Joseph Goldstein.

Check out the article I made on why a chatbot can be a useful interface:

--

--

Jens

Buddhist entrepreneur & Product Manager. Excited about the potential of Generative AI