Q&A Chatbot for YouTube with text-davinci-003 and text-embedding-ada-002

Greyboi
12 min readJan 8, 2023
A robot giving a lecture about a youtube video. Include youtube logo. Cartoon style, colorful

NOTE: All code in this article comes from my ai experiments repo:
https://github.com/emlynoregan/newaiexp

In Summarize Youtube with text-davinci-003, I presented a python script that summarizes a youtube video like this:

> python .\ytsummary.py https://www.youtube.com/watch?v=ZdrZf4lQTSg

Summary of section beginning at 0:00:00
— — -
In this section of the transcript, the speaker explains the concept of space-time and how it differs from two-dimensional space. They explain how the Lorentzian Distance is used to calculate distances in space-time, and how the time that passes for an observer moving on any curve in space-time is the length of that curve, which is called the “proper time”. They also discuss the concept of time dilation, and how absolute velocities have no physical meaning.

Summary of section beginning at 0:10:00
— — -
This section of the transcript discusses the resolution of the “twin paradox” and Newton’s bucket paradox. It explains that acceleration is absolute and can be measured with a spring, and that time dilation comes from acceleration, not from relative velocity. It also explains that gravity is not a force, but is caused by the curvature of space-time. Finally, it suggests that actively engaging with a topic is the best way to understand it.

Summary of section beginning at 0:20:00
— — -
This section of the transcript discusses the benefits of using Brilliant, an online learning platform. It explains that the courses come with interactive visualizations and questions to check understanding. It also mentions a course on quantum mechanics created in collaboration with Brilliant. Finally, it provides a link and a discount code for the first 200 subscribers.

Overall summary
— — -
This transcript discusses the concept of space-time and how it is different from two-dimensional space. It explains the Lorentzian Distance and time dilation, and how acceleration is absolute and can be measured with a spring. It also resolves the twin paradox and Newton’s bucket paradox, and suggests that actively engaging with a topic is the best way to understand it. Finally, it mentions the benefits of using Brilliant, an online learning platform, and provides a link and a discount code for the first 200 subscribers.

(this is the summary of https://www.youtube.com/watch?v=ZdrZf4lQTSg, “Special Relativity: This Is Why You Misunderstand It” by Sabine Hossenfelder)

This is exactly the sort of video that I might summarize to get a quick overview of what’s in the video; is it really interesting? Do I want to know more?

So I look at this summary, and decide, sure, I do! So, what are my options? I guess I’m back to watching the video from beginning to end?

I didn’t like that answer, so I built an AI-driven Q&A bot for asking questions about a youtube video. I run it like this:

> python ytqa.py https://www.youtube.com/watch?v=ZdrZf4lQTSg

and, after it downloads the transcript and does a bunch of processing, I can have an interactive session like this:

Hi, I’m a bot. Ask me a question about the video, and I’ll try to find the answer for you.

> the speaker says space-time is different from 2d space. What does she say in particular about this?

The speaker says: “If you have two dimensions of space, then all points at the same distance from the origin would be on a circle. Different distances correspond to circles with different radii. But if you do this in space-time, then all points at the same space-time distance from the origin are hyperbole. You can’t move on those lines because that’d require you to move faster than light. But you could move on one of those. That would require a constant acceleration.”

> what else does she say? anything?

The speaker also explains that the key to making sense of space-time is understanding that the time that passes for an observer moving on any curve in this space-time is the length of that curve. She also states that Einstein was the one to understand that if you combine space with time, then time becomes a coordinate, like space. Additionally, she explains that the reason time slows down is acceleration, and that Special Relativity does describe acceleration, but only in flat space-time. Finally, she explains that gravity is not a force which is why being at rest with the surface of a gravitating body requires an acceleration and that too slows down time.

> What is the lorentzian distance?

The Lorentzian Distance is a peculiar notion of distance in space-time, named after Hendrik Lorentz, that is used to calculate the length of a curve in space-time. It is used to calculate the time that passes for an observer moving on any curve in space-time, which is referred to as the “proper time”. The Lorentzian Distance is calculated by taking the square root of the difference between the square of delta t and the square of delta x, divided by c, where c is the speed of light.

> How does all this relate to time dilation?

The speaker explains that time dilation is a real effect that comes from acceleration, and is caused by the difference between the coordinate time and the proper time. This difference is calculated using the Lorentzian Distance, which is the square root of the difference between the square of delta t and the square of delta x, divided by c, where c is the speed of light. This is what causes time to slow down, and is why time passes differently for observers moving at different velocities or in different gravitational fields.

> What is newton’s bucket paradox?

Newton’s bucket paradox is an experiment proposed by Isaac Newton in which a bucket is filled half-way with water, and then spun. According to Newton, when the bucket is spun, the water will be pushed against the sides and form a dip in the middle, and it will also start spinning with the bucket. Newton argued that the reason for this was that the bucket with the water moves relative to the rest of the universe. However, the speaker explains that the real reason why the water behaves differently in both cases is that a change of direction is also an acceleration, and acceleration is absolute. This is why the time on Bob’s trip is always shorter than Alice’s, and why Newton’s bucket paradox is not a paradox.

> is that the same as the twin paradox?

No, the twin paradox and Newton’s bucket paradox are not the same. The twin paradox is a thought experiment proposed by Albert Einstein to explain the effects of time dilation due to relative motion. In the thought experiment, two identical twins, Alice and Bob, go on different journeys. Bob goes on a trip to Andromeda, takes a selfie as he passes by the supermassive black hole, and hurries back home to post it on twitter. When he comes back, who is older, Bob or Alice? The resolution of the twin paradox is usually to point out that actually the situation is not symmetric. Because for Bob to make a round trip he cannot move at constant velocity. He needs to accelerate to turn around. Newton’s bucket paradox, on the other hand, is an experiment proposed by Isaac Newton in which a bucket is filled half-way with water, and then spun. The speaker explains that the real reason why the water behaves differently in both cases is that a change of direction is also an acceleration, and acceleration is absolute.

> How does she say we should engage with this topic?

The speaker suggests that in order to really understand the topic, one needs to actively engage with it. She recommends checking out Brilliant’s interactive courses on a variety of topics in science and mathematics, including special relativity, as a way to do this. She believes that this fresh and new method of learning can help to understand the topic better.

This is a pretty good result! I don’t feel like I understand general relativity based on this session, but I do understand that I don’t understand, and that’s something.

Find the link at the top of the article to the github repo to try this out for yourself.

So how does it work?

Every time the user asks a question, and the bot needs to respond to it, ytqa.py sends a completion request to OpenAI, using text-davinci-003 (the same thing being used by ChatGPT I think).

What prompt do I send? Well, it currently looks like this:

prompt = f"""
The transcript of a video includes the following sections that
might be relevant to the question:

{top_k_similar_utterances_str}

The user and the AI are having a conversation about the video.
Here's the most recent transcript of the conversation:

{most_recent_chat_utterances_str}

Using the information above, answer the following question.
Try to stick to information in the video transcript,
and include quotes where that's helpful (but not too many):
{user_input}
"""

You can see this prompt requires me to provide three things:

  • The most recent chat utterances: I just keep a history of the back & forth between the user and the bot.
  • user input: That’s the last input from the user, it’s actually also in the recent chat utterances and is redundant (note to self to refine this)
  • The top k similar utterances: This is a selection of utterances from the video transcript that might be helpful to GPT-3.5 in answering the user’s question.

Of these, the top k similar utterances wants some explanation.

Ideally I’d just insert the entire video transcript into the prompt, to give GPT3.5 the material to answer the question with. But, in general it’s too long to fit in a prompt. The transcript could be any length, but the prompt can only be roughly 2000 words long, maximum.

So what I need is to somehow select only the most relevant parts of the transcript to put into the prompt for the question that needs to be answered.

To do that, I’m using a technology called Embeddings

Embeddings

An embedding is a list of floating point numbers. For any transformer model, the length of an embedding will be fixed. I’m using text-embedding-ada-002 from openai, whose embeddings are 1536 elements long. ie: that’s a list of 1536 numbers.

And what is an embedding? It comes from the internal representation of concepts inside a transformer’s neural net, but basically it represents the meaning of an utterance.

sentence (string) -> (text-embedding-ada-002) -> meaning (embedding)

The embedding itself is almost completely opaque for us; we don’t know what any of the numbers mean. However, the one operation we can perform is to compare the relative meaning of two embeddings.

If you treat each embedding as a point in space (in this case a point in 1536 dimensional space), then the closer two points (embeddings) are two each other in that space, the more similar their meaning.

In machine learning, people tend to use the cosine similarity to calculate the angle between two points (as vectors), which achieves the same thing as distance (particularly with normalised embeddings like those from OpenAI that sit on a unit circle).

Here’s a ChatGPT explanation of cosine similarity:

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It is defined as follows:

cosine_similarity = dot(A, B) / (norm(A) * norm(B))

where:

dot(A, B) is the dot product of vectors A and B

norm(A) is the Euclidean norm of vector A, which is the square root of the sum of the squares of the elements of A

norm(B) is the Euclidean norm of vector B, which is the square root of the sum of the squares of the elements of B

The dot product of two vectors is the sum of the products of the corresponding elements of the vectors, while the Euclidean norm of a vector is a measure of the length of the vector.

So, the cosine similarity between two vectors is a measure of the angle between the vectors, with a value between -1 and 1. A value of 1 indicates that the vectors are pointing in the same direction, a value of 0 indicates that the vectors are orthogonal (i.e., perpendicular), and a value of -1 indicates that the vectors are pointing in opposite directions.

In all cases where I’ve worked with embeddings, this calculation has been a number between zero and one (inclusive at both ends maybe?).

How does this all help?

Say I have the user’s question Q as a string, and I have k utterances from the video transcript, U[0] … U[k-1].

I can calculate the embedding E(Q) for the question, and the embeddings E(U[i]) for each utterance, by calling OpenAI’s embeddings API.

Then, I can calculate the similarity of each utterance to the question, by calculating cosine_similarity(E(Q), E(U[i])) for each of the k utterances.

Once I’ve done this, I can sort the utterances by cosine similarity from high to low, and take, eg, the first 20 utterances from that sorted list; these will be the utterances that are about similar things to the question, and are most likely to contain the material to help answer the question.

The script is a little more complex than this though; we have a couple of problems that still need addressing.

Problem 1: the utterances in the transcript are incomplete.

When I download the transcript of a youtube video, it breaks the utterances into little pieces, smaller than entire sentences. Here’s an example from Sabine’s video above:

0:06 — feel longer if it’s a job interview, but even the most awkward conversation doesn’t actually slow

0:11 — down time. And yet, Einstein said, time can slow down — it doesn’t always pass at the same rate.

0:19 — Just what did Einstein say about the passage of time? What’s the resolution of the twin paradox,

0:24 — what’s up with Newton’s bucket, and what does it all mean? That’s what we’ll talk about today.

0:34 — When I was a teenager, I was super interested in Einstein’s theories of space and time. And

0:39 — I read a lot of popular science books about them. I didn’t understand a thing. I then

0:45 — went on to do a PhD in physics. And today it’s my turn trying to explain it to you.

Those aren’t too bad, but you can see that to answer a question, we might need longer contiguous utterances that go together.

In the script I concatenate all consecutive utterances of length 1, 3, 5, 7, and 9, and synthesise new utterances from these. I calculate embeddings for all of these. Then when I’m grabbing the most similar utterances, I use start time and length to identify overlapping (redundant) utterances, and throw out the less similar ones.

That’s a fairly primitive technique, but it seems to work. I think it’d probably be smart to include even longer concatenations. Or another possibility would be, after choosing the top 20 most similar utterances, to actually lengthen those utterances with some of the utterances that both precede and follow the matching utterance, on the assumption that more context might be better.

Problem 2: The user’s question might be too vague.

It’s all great to use embeddings to match questions to utterances, but what about a question like this:

> what else does she say? anything?

That doesn’t contain any useful information on what the question is really about (it’s relying on previous context), so the embedding calculated will likely be useless.

To fix this, I create the following prompt:

prompt = f"""
The user and the AI are having a conversation about the
video. Here's the most recent transcript of the conversation:

{most_recent_chat_utterances_str}

Rewrite the final user response, using the rest of the chat
as context, producing a sentence that would be more useful for
creating an embedding for semantic search:
"""

and I ask GPT3.5 for the answer.

Here’s what it does for the example above (you can see these rewritten questions by adding the “ — diagnostics” flag to ytqa.py):

> what else does she say? anything?
***User input for embedding:
What other points does the speaker make about space-time being different from two-dimensional space?

When calculating embeddings & cosine similarity for the question Q, I don’t actually use Q, instead I use this rewritten sentence Q’.

You can see that Q’ contains all the contextual information we need to then calculate and compare an embedding.

I don’t use Q’ when actually asking GPT3.5 to answer the question! In that case it gets the chat history and the original question. So, if the rewriting loses some of the sense of the original question, that’s ok. But, the semantically similar chat utterances are retrieved using Q’.

Conclusion

So I’ve presented a Q&A bot for interactively answering questions about a youtube video. It relies on the concept of embeddings to cut down the amount of the transcript that we need to put in a prompt for GPT3.5, which is essential to work within its limitations.

You can use this embeddings technique for many other purposes; for example, if you want to create a chatbot with an unlimited length conversation history, you could calculate embeddings for everything the human and bot say to each other, and when you prompt for the bot’s next response, use cosine similarity to find the top K relevant historical utterances by bot and/or human, and put those into the prompt, rather than just the top K most recent utterances.

Also, this Q&A approach could be used for any document; books, long text files and web pages, podcasts, whatever you find useful.

I find myself using ytsummary.py and ytqa.py in tandem to “experience” long youtube videos now, without watching them, and it’s saving me a lot of time. I hope it saves you time too! Let me know if you use these scripts, I’d love feedback. Thanks for reading!

--

--

Greyboi

I make things out of bits. Great and terrible things, tiny bits.