Summarize YouTube videos with LlamaIndex and GPT

Daniel Crouthamel
9 min readMar 17, 2023

--

Preface:

Update 08/2023 — A lot of links and code in this article are now dead. I have an updated article below, which makes use of llama-index version 0.8.0. It’s always changing! I can’t keep up!

Updated Article

At the time I wrote this article I was using a version of LlamaIndex < 0.5. If you use LlamaIndex 0.5 or greater, than the code pertaining to it below will not work. I’m currently making changes and will provide an update soon. There are significant API changes starting with version 0.5.

https://github.com/jerryjliu/llama_index/releases/tag/v0.5.0

Introduction

In my previous article, Summarize YouTube Videos with GPT-3, I discussed how to download transcripts of YouTube videos and then summarize and ask various questions of the content using GPT. That implementation was limited to short transcripts or content since the combined input request and response from the language model can’t exceed a certain number of tokens or words. In this new article, I’d like to present a more elaborate solution I think that makes use of LlamaIndex (GPT Index).

ChatGPT does a pretty good job with many NLP tasks for content it was trained on. What about other data, files I’ve accumulated over the years, or just about any large corpus of data or knowledge that you would like insights from? Enter LlamaIndex. Think of it as an interface between your external data and Large Language Models (LLMs).

YouTube Transcripts Download

But before we dive into LlamaIndex, let’s scrap some fun data to work with! Ancient Aliens. My wife and I, along with some friends, attended AlienCon recently in Pasadena since it was close by. It was a fun time! Now I wish I had summarized the videos before hand because I could have then gone to AlienCon as an Ancient Astronaut Theorist expert!

The playlist below contains 500+ some videos from the Ancient Aliens: Official Series Playlist. I thought this would be fun corpus to work with. Much is happening in the AI space (OpenAI is now closed??), and there seems to be a lot happening in the UAP (Unidentified Aerial Phenomena) space as well. I have my own theories on that, and I do think there are interesting times ahead of us.

It’s pretty easy to grab the transcripts from the playlist and download them as text files. The function below will take a playlist id and create text files in the output directory defined. You’ll need a YouTube Data API key for this to work. Additionally, make sure you’re OPENAI_API_KEY is set as well as we will need it later. I’m assuming familiarity with Python here.

import googleapiclient.discovery
import os

from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor
from llama_index.indices.knowledge_graph.base import GPTKnowledgeGraphIndex
from langchain.chat_models import ChatOpenAI
from IPython.display import Markdown, display
from pyvis.network import Network
from youtube_transcript_api import YouTubeTranscriptApi

# Use this to set your key, if it's not already set.
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

def save_transcripts_to_files(api_key, playlist_id, output_dir):
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=api_key)

videos = []
next_page_token = None

while True:
request = youtube.playlistItems().list(
part="contentDetails,snippet",
playlistId=playlist_id,
maxResults=50, #YouTube only returns 50 at time, so need loop
pageToken=next_page_token
)
response = request.execute()

for item in response["items"]:
video_id = item["contentDetails"]["videoId"]
video_title = item["snippet"]["title"]
videos.append((video_id, video_title))

next_page_token = response.get("nextPageToken")

if not next_page_token:
break

if not os.path.exists(output_dir):
os.makedirs(output_dir)

for video_id, video_title in videos:
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
safe_title = "".join([c for c in video_title if c.isalnum() or c.isspace()]).rstrip()
with open(os.path.join(output_dir, f"{safe_title}.txt"), "w") as file:
for entry in transcript:
file.write(entry['text'] + ' ')
print(f"Transcript saved to {safe_title}.txt")
except Exception as e:
print(f"Error fetching transcript for video ID {video_id} ({video_title}): {str(e)}")


api_key = os.getenv('YOUTUBE_API_KEY')
playlist_id = "PLob1mZcVWOaiVxrCiEyYXcAbmx7UY8ggW"
output_dir = "ancient-aliens-official"

save_transcripts_to_files(api_key, playlist_id, output_dir)

We now have a directory of text files to work with. Note, these could be any directories of text you have, including PDFs. LlamaIndex has data connectors for different sources of data. In the example below I’ll use the SimpleDirectoryReader which can support many file types, e.g., .pdf, .docx, .pptx, .md, etc.

LlamaIndex

The Primer to using LlamaIndex is a good place to start. LlamaIndex helps remove concerns over prompt size limitations by allowing us to query our data and use the relevant parts of it in whatever LLM task we wish to use. This is done by creating an index of embeddings. The below is taken from the GPTSimpleVectorIndex class. I thought it was a good explanation.

During index construction, the document texts are chunked up, converted to nodes with text; they are then encoded in document embeddings stored within the dict. During query time, the index uses the dict to query for the top k most similar nodes, and synthesizes an answer from the retrieved nodes.

We can break up the usage of LlamaIndex into 3 parts.

  • Load Documents
  • Create Index
  • Query Index

Let’s look at the first two parts in the code below. We first read in the transcripts we created with a simple function, and then another single function to create the index. In this case, I’m using the SimpleVectorIndex. There are other types of indexes you can create for other use cases, be sure to check that out. I’ll keep it simple for now, and then look at a Knowledge Graph index later.

# Read documents from disk
documents = SimpleDirectoryReader('ancient-aliens-official').load_data()

# Build Index, and save to disk
index = GPTSimpleVectorIndex(documents)
index.save_to_disk('index/index-ancient-aliens-official.json')

Update Since Publishing: Play with the chuck_size_limit and include_extra_info params when building the index. Smaller chunks may work better for your use case. For the include_extra_info param, passing the titles of the YouTube videos will provide extra metadata.

Update 2! If you are using llama-index version 0.5 or later, the api is different. You no longer use a Document object, but rather Nodes. Therfore, the code below will no longer work. Check our their discord server if you need help. That’s where I go!

Release v0.5.0 · jerryjliu/llama_index (github.com)

Query Index

Now the fun part, let’s query! Below I load the index and then query it with the gpt-3.5-turbo model. GPT 4 was just announced recently. I’m excited to see what support there will be for it in LlamaIndex. I’ll be watching closely.

# load from disk
index = GPTSimpleVectorIndex.load_from_disk('index/index-ancient-aliens-official.json')

# LLM Predictor (gpt-3.5-turbo)
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))

# Query
response = index.query(
"Summarize the Alien DNA Found in Octopus Genetics episode",
llm_predictor=llm_predictor,
similarity_top_k=3
)

# Display response
display(Markdown(f"<b>{response}</b>"))

We know what the answer is to the below question, I’m not saying it might be aliens, but it might. Let’s see what our query returns.

Query: “Summarize the Alien DNA Found in Octopus Genetics episode”

The episode explores the claim made by a team of 33 scientists in March 2018 that octopuses possess extraterrestrial DNA. The octopus genome is found to be more complex than the human genome, with around 50,000 genes compared to the human’s 25,000 genes. There is evidence to suggest that this complexity came from outside, as scientists study octopuses and find no connection to any ancestor of the octopus. It’s like these creatures were brought here in their entirety and put into our oceans. The episode speculates that the octopus may be related to a race of extraterrestrials that visited Earth thousands of years ago, as depicted in mythological accounts from across the ancient world that depict a creator God in the form of an octopus. The episode also explores the theory that the octopus’s unique genetics may have been influenced by organic material from meteorites that crashed into the ocean, which could have been sent by extraterrestrials to interact with life on Earth and accelerate its evolution.

Fascinating. Let’s try another query.

Query: “Tell me more about the Stone Spheres in Costa Rica”

The Stone Spheres in Costa Rica are a mystery, with some theories suggesting that they were used for ceremonial rituals, as star maps, or to concentrate thought and establish a bridge to another dimension. However, ancient astronaut theorists suggest that they could be misunderstood technology, similar to the Mayan stone giants found in Copan, Honduras. These stone giants were designed in such a way as to be able to digitally store valuable data, and it’s possible that the Stone Spheres in Costa Rica could have a similar purpose. In Sierra Leone, Africa, mysterious figures known as Nomoli have been discovered, which represent sky gods. When they fell from the heaven, part of the sky turned to stone and rained down with them. Curiously, strange blue stones called sky stones have been found buried with many of the Nomoli figurines. Tests conducted on the stones found with the Nomoli statues have so far produced conflicting and sometimes even confounding results, with some suggesting that they could be of extraterrestrial origin. The stones have been checked for the common blue minerals, but they cannot tell us whether these are natural or artificial, what’s causing the blue color, if this is ancient or modern.

Another pretty detailed response. I would have to watch the episode or look at the transcript to verify if all of the above is mentioned. I suspect it is, but I do think ChatGPT will include information from outside our knowledge base when it can’t answer adequately.

Update: I did finally get access to the GPT-4 API, and the results seem to be a bit richer. I also changed max_tokens to 512 (longer output) when creating the LLMPredictor and I’m using a custom refine template. Below is the response using GPT-4 for the same question above.

The Stone Spheres in Costa Rica, also known as Las Bolas, were discovered in the 1930s while clearing jungles. Over 300 stone spheres have been found at various archaeological sites across the region. These spheres are some of the most precise stone carvings in the ancient world, with some experts believing they could be 96% completely spherical. They vary in size, with some being as large as 10 feet in diameter, while others are much smaller, similar to the size of a basketball or car tire.

Creating a perfectly spherical round ball is difficult, especially in a primitive area like Costa Rica. The purpose of these stone spheres remains a mystery. Some smaller balls were found buried with jungle-dwelling local chiefs, but the larger ones could not fit in any grave. According to Costa Rican legends, the stone spheres were used as cannonballs by the god of thunder to drive away the god of wind and hurricanes. Some researchers suggest that these spheres might have been used as star maps or to establish a connection to celestial beings.

In a similar vein, ancient astronaut theorist Giorgio Tsoukalos has proposed that certain ancient stone structures, like the Mayan stone giants at Copan, Honduras, could have been designed to digitally store valuable data. While this theory has not been directly applied to the Stone Spheres of Costa Rica, it is an interesting perspective on the potential purpose and technological capabilities of ancient civilizations. Additionally, the discovery of mysterious blue stones called “sky stones” in Sierra Leone, which are believed to have an extraterrestrial origin, raises further questions about the possible connections between ancient artifacts and celestial beings.

Knowledge Graph Index

As mentioned earlier, there are other indexes we can create. I’d like to conclude by looking at the GPTKnowledgeGraphIndex and then create a visual representation of the graph. The source code comments for the index are sparse, “Build a KG by extracting triplets, and leveraging the KG during query-time.” Let’s give it a try and see what happens. I’ll have to dig into the source code later and better understand what it is doing.

The code below will build the KG index for the given document set and llm_predictor defined, and then save to disk. Note, this took quite some time for me, about an hour. Be prepared. See the KGDemo notebook for more info.

# Build Index, and save to disk
# NOTE: can take a while!
index = GPTKnowledgeGraphIndex(
documents,
chunk_size_limit=512,
max_triplets_per_chunk=2,
llm_predictor=llm_predictor
)

index.save_to_disk('index/index-ancient-aliens-official-kg.json')

Once the index is built, we can create an HTML file to visualize the graph.

## create graph
index = GPTKnowledgeGraphIndex.load_from_disk('index/index-ancient-official-kg.json')

g = index.get_networkx_graph()
net = Network(notebook=False, cdn_resources="in_line", directed=True, height="1200px")
net.from_nx(g)

html = net.generate_html()
with open("ancient-aliens-official-graph.html", mode='w', encoding='utf-8') as fp:
fp.write(html)

Open the html file in a browser. Note, this too will take some time to load, maybe 5 minutes. I probably chose a poor example, but I was curious to see what sort of connections might be made from all the content mentioned in the Ancient Aliens episodes. Graph shown below.

Ancient Aliens Knowledge Graph

Wow — what’s going on in the middle there? Let’s zoom in, much to see.

Interesting. I suppose you can navigate around the graph and get a sense of what topics are being discussed and how they are related, which is neat to see.

Thanks for reading!

I hope this gives some an idea of how one might connect external data to a large language model like GPT. There are many other types of indexes and functionality to explore with LlamaIndex. You might want to explore that more, depending on your use case. I suspect in time, Microsoft will bake this into the OS? It would be nice to be able to right click on a folder or just ask a questions, summarize, etc, your own files you have on your computer.

--

--

Daniel Crouthamel

Epic Tapestry Consulting / ML Engineer, Graph DB and NLP Enthusiast / Quinquagenarian