Summarize Youtube with text-davinci-003

Greyboi
8 min readDec 18, 2022
“A tablet showing a youtube video being dropped into a sausage making machine. Stock photo, 8k, realism, detailed.” — yeah ok

Someone sends you a link to a video. Looks interesting! But you click it, and, oh man, it’s 3 hours long. Is any of this worth your time?

I have this problem daily. So, I wrote this short python script, ytsummary.py, that, based on the transcript, will summarize a youtube video in 10 minute chunks, and also provide an overall “summary of summaries”.

Here’s the code: https://github.com/emlynoregan/ytsummary/blob/master/README.md

For example, I work with serverless computing on AWS. And here’s a talk from AWS Reinvent 2022 about lambda:

Might be interesting? Let’s have a look…

Oh, so many minutes :-(

Let’s use ytsummary to figure out what’s in there:

> python ytsummary.py https://www.youtube.com/watch?v=0_jfH6qijVY

Found 6 chunks

Summary of chunk 1:
In this section, Julian Wood introduces the talk and
explains how Lambda is used to build modern
applications with the lowest total cost of ownership.
He also talks about how Lambda has evolved over the
years, with features such as provision concurrency,
container images, and 10 gig functions. He then goes
on to discuss how customers are using Lambda for
various applications, such as IT automation, data
processing pipelines, microservices-based applications,
and machine learning applications. He also explains
the importance of security, durability, availability,
and features in the Lambda service, and how AWS takes
on more of the security in the cloud for serverless
applications. Finally, he talks about the open
source Firecracker technology and the two types of
invocation models for Lambda.

Summary of chunk 2:
This section of the transcript discusses how Lambda
works to process both synchronous and asynchronous
invokes. It explains how the frontend service handles
authentication and authorization, the assignment
service manages the execution environment, the
placement service optimizes the placement of the
environment, and the control plane service manages
the lifecycle of the assignment service nodes. It
also explains how the event invoke frontend service
and pollers handle asynchronous invokes, and how
event source mappings are used for other service
integrations.

Summary of chunk 3:
In this section, Julian and Chris discuss how Lambda
manages state to serve invokes. They explain that
Lambda uses an invoke frontend and poller fleet to
get invoke input to the correct machine at the
correct time, and that code is downloaded from S3. To
improve the Lambda service, they implemented
container packaging support, which required them to
rethink the mechanism by which code is delivered
into the execution environment. They realized that if
they could download and present only the necessary
bits of a container image to the environment, they
could get the environment started more quickly and
amortize code delivery time and cost. To do this,
they changed the way container images are persisted
in the Lambda storage subsystem and broke the binary
representation of the file system into chunks on the
block device.

Summary of chunk 4:
In this section, the speaker discusses how Lambda
uses container images and virtual machines to
optimize code delivery and state management. They
explain how they use chunking and convergent
encryption to deduplicate data while maintaining
security, and how they use Firecracker to reduce
overhead and improve efficiency of the VM fleet.

Summary of chunk 5:
In this section, Chris talks about how Lambda
leveraged lessons from storage services to improve
the performance, efficiency, and overall experience
of the Lambda service. He explains how Lambda uses
customer access patterns to influence how data is
laid out in the storage subsystem, how shared state
is important for utilization and performance, and
how storage services spend a lot of time meeting
their caller where they are to hide the complexities
inherent in a storage service from a customer. He
then talks about how Lambda solved Java cold starts
with the use of Firecracker and microVM snapshot and
restore, and how this improvement to the cold start
experience was made possible by turning a compute
problem into a storage problem. Finally, he mentions
the issue of state with the Worker Manager service
and how the control plane service manages Worker
Managers.

Summary of chunk 6:
In this section, the speaker discusses how the
Lambda service works to efficiently run workloads
and minimize idle time. They explain how the
Assignment Service works to ensure that execution
environments are not orphaned in the event of a
zonal failure, and how the service is written in
Rust for performance, latency, and memory safety.
The speaker also mentions how they use models and
machine learning to pack workloads optimally
together, and how they have a team of research
scientists dedicated to the placement problem.
Lastly, they encourage viewers to learn more about
serverless on AWS and to provide a five star
rating for the session survey.

Summary of summaries:
This video discusses how Lambda is used to build
modern applications with the lowest total cost of
ownership. It explains how Lambda works to process
both synchronous and asynchronous invokes, and how
it uses container images, virtual machines, and
Firecracker technology to optimize code delivery
and state management. It also talks about how
Lambda leveraged lessons from storage services to
improve the performance, efficiency, and overall
experience of the Lambda service, and how the
Assignment Service works to ensure that execution
environments are not orphaned in the event of a
zonal failure. Finally, it encourages viewers to
learn more about serverless on AWS and to provide
a five star rating for the session survey.

>

I’m pretty sure those summaries are better than I do, if I were to watch the video and try to write them by hand.

And, they’re extremely informative! Reading through this summary, I can see that while there’s a lot of cool detail in there on how Lambda is implemented at AWS, it contains next to nothing that would actually change how I’d use it, and no hints of future features. So, I can probably skip this.

How does it work? This is what text-davinci-003 thinks:

This program is a python script that is used to summarize the transcript of a YouTube video. It takes a YouTube video ID or URL as an argument and then retrieves the transcript of the video. The transcript is then split into chunks of 10 minutes each. For each chunk, the program calls an API from OpenAI to summarize the chunk into a single sentence. Once all of the chunks have been summarized, the program calls the OpenAI API again to summarize the summaries of the chunks. Finally, it prints out the summary of the summaries.

10/10, perfect. Great job.

Here’s the full technique for that; I pasted the program into OpenAI’s sandbox, using text-davinci-003, and asked it to describe how it works. I cut out all the optional stuff in the code (about mentioning people) for brevity:

Here is a program written in python.
---
# In this program we will read the transcript of a youtube video and summarize it
# Where the transcript is longer than 10 minutes, we will split it into 10 minute chunks

import setcreds
import openai
import sys
from youtube_transcript_api import YouTubeTranscriptApi

diagnostics = 0
include_mentions = 0

def get_video_id_from_video_id_or_url(video_id_or_url):
# a youtube video id is 11 characters long
# if the video id is longer than that, then it's a url
if len(video_id_or_url) > 11:
# it's a url
# the video id is the last 11 characters
return video_id_or_url[-11:]
else:
# it's a video id
return video_id_or_url

def get_chunks_from_youtube(video_id):
# this function will get the transcript of a youtube video
# and return it as an array of chunks
# where each chunk is an array of lines

# first get the transcript
transcript = YouTubeTranscriptApi.get_transcript(video_id)

chunks = []

start_timestamp = 0.0
current_timestamp_mins = 0.0

current_chunk = []

for entry in transcript:
current_timestamp_mins = entry['start'] / 60.0

# if the current timestamp is more than 10 minutes after the start timestamp
# then we have a chunk
if current_timestamp_mins - start_timestamp > 10:
# add the current chunk to the list of chunks
chunks.append(current_chunk)
# reset the start timestamp
start_timestamp = current_timestamp_mins
# reset the current chunk
current_chunk = []

# add the line to the current chunk
current_chunk.append(entry['text'])

# add the last chunk
if len(current_chunk) > 0:
chunks.append(current_chunk)

print(f"Found {len(chunks)} chunks")

return chunks

def summarize_chunk(index, chunk):
chunk_str = "\n".join(chunk)
prompt = f"""The following is a section of the transcript of a youtube video. It is section #{index+1}:
{chunk_str}
Summarize this section of the transcript."""

if diagnostics:
# print each line of the prompt with a leading # so we can see it in the output
for line in prompt.split('\n'):
print(f"# {line}")

completion = openai.Completion.create(
engine="text-davinci-003",
max_tokens=500,
temperature=0.2,
prompt=prompt,
frequency_penalty=0
)

msg = completion.choices[0].text

if diagnostics:
print(f"# Response: {msg}")

return msg

def summarize_the_summaries(summaries):

summaries_str = ""
for index, summary in enumerate(summaries):
summaries_str += f"Summary of chunk {index+1}:\n{summary}\n\n"

prompt = f"""The following are summaries of a youtube video in 10 minute chunks:"
{summaries_str}
Summarize the summaries."""

if diagnostics:
# print each line of the prompt with a leading # so we can see it in the output
for line in prompt.split('\n'):
print(f"# {line}")

completion = openai.Completion.create(
engine="text-davinci-003",
max_tokens=500,
temperature=0.2,
prompt=prompt,
frequency_penalty=0
)

msg = completion.choices[0].text

if diagnostics:
print(f"# Response: {msg}")

return msg

def main():
# Get the transcript of the video
if len(sys.argv) < 2:
print("Usage: python3 sumvid.py <video id or url>")
sys.exit(1)

# transcript_file_name = sys.argv[1]
video_id_or_url = sys.argv[1]

# if the video id or url is a url, extract the video id
video_id = get_video_id_from_video_id_or_url(video_id_or_url)

if len(sys.argv) > 2:
for arg in sys.argv[2:]:
if arg == "--diagnostics":
global diagnostics
diagnostics = True

if arg == "--mentions":
global include_mentions
include_mentions = True

# chunks = get_chunks(transcript_file_name)
chunks = get_chunks_from_youtube(video_id)

if len(chunks) == 0:
print("No chunks found")
elif len(chunks) == 1:
summary = summarize_chunk(0, chunks[0])
print(f"\nSummary: {summary}")

else:
# Now we have the chunks, we can summarize each one
summaries = []
for index, chunk in enumerate(chunks):
summary = summarize_chunk(index, chunk)
summaries.append(summary)
print(f"\nSummary of chunk {index+1}: {summary}")

# Now we have the summaries, we can summarize the summaries
summary_of_summaries = summarize_the_summaries(summaries)

print(f"\nSummary of summaries: {summary_of_summaries}")

if __name__ == "__main__":
main()

---
Provide a description in a couple of paragraphs of how this program works.

This program is a python script that is used to summarize the transcript of a
YouTube video. It takes a YouTube video ID or URL as an argument and then
retrieves the transcript of the video. The transcript is then split into
chunks of 10 minutes each. For each chunk, the program calls an API from
OpenAI to summarize the chunk into a single sentence. Once all of the chunks
have been summarized, the program calls the OpenAI API again to summarize
the summaries of the chunks. Finally, it prints out the summary of the
summaries.

It’s child’s play, really. I guess that’s what it means to have AGI (Artifical General Intelligence) available.

I’ve been working with AGI for the last year or so. I’ve been tinkering with the original davinci model, and I was on the beta test of Github Copilot (I currently pay for it, and man, if you’re a developer, it’s the best $10/month you could spend on anything). And of course I’ve been playing with the image generators — fun!

I *do* believe this technology is AGI, in the same sense that computers are General Purpose Machines. Computers are called General Purpose Machines to distinguish them from Special Purpose Machines, because they are programmable. Likewise, computer software is a form of Special Purpose Intelligence; it can only do the one intelligent thing it was written to do, but the new large language models are General Purpose Intelligence, or just General Intelligence, because they can be applied to a whole plethora of tasks that require intelligence, and that they were not specifically programmed / designed to do.

End of rant. I hope you find this youtube summarizer useful!

--

--

Greyboi

I make things out of bits. Great and terrible things, tiny bits.