Summarize YouTube Videos with LlamaIndex — Part 2

Introduction

Daniel Crouthamel
18 min readAug 13, 2023

A few months ago I published an article showcasing how one could use LlamaIndex to summarize YouTube videos based on transcripts. LlamaIndex has changed a lot since then, and many of the links in that previous article are now dead, along with non functional code. Things change so quickly today, very hard to keep up! Nevertheless, I thought it would be fun to create a new project which uses the updated API. Note, at the time of this writing I’m using llama-index version 0.8.0. Notebook for this article can be found on Github.

The areas that will be covered in this short article are the following:

  • YouTube transcript downloader
  • Loading transcripts and creating a VectorStore index
  • Persist index to disk and reload
  • Query index and examine nodes/chunks used by LLM
  • Using similarity_top_k to change the number of nodes/chunks sent to LLM
  • Using Llama-Index debugging to examine events
  • Refreshing an index with new documents
  • Using Jupyter-to-Medium to publish notebook to Medium

Llama-Index has many different index structures that can be used, but this article will only be using the VectorStore index. Future articles will take a deeper dive into the different index structures and how to use them. Additionally, I also want to take a look at using local resources for embedding and response generation, as opposed to using OpenAI.

YouTube Transcript Downloader

Before moving into the details of LlamaIndex, I’ll post the code that I use to download transcripts from YouTube playlists. Once again, I’ll use the Ancient Aliens playlist for my corpus of data to play with. UAPs and extraterrestrial biologics are quite an interesting topic these days. But if that doesn’t suit you, then feel free to change the playlist ID and follow along with whatever topic/playlist that interests you!

As mentioned before in the previous article, you will need to obtain a YouTube Data API key. The one major change I made to the script since last time was to only download new transcripts in the playlist that aren’t already present in my local directory. I do this by sorting the videos by publish date, descending, and once I encounter a transcript I already have I then stop.

def save_transcripts_to_files(api_key, playlist_id, output_dir):
# Build the YouTube API client using the provided API key
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=api_key)

# Get all the videos in the playlist, sorted by date
videos = []
next_page_token = None
while True:
request = youtube.playlistItems().list(
part="contentDetails,snippet",
playlistId=playlist_id,
maxResults=50,
pageToken=next_page_token
)
response = request.execute()

# Add each video to the list of videos
for item in response["items"]:
video_id = item["contentDetails"]["videoId"]
video_title = item["snippet"]["title"]
video_date = item["snippet"]["publishedAt"]
videos.append((video_id, video_title, video_date))

# Check if there are more videos to fetch
next_page_token = response.get("nextPageToken")
if not next_page_token:
break

# Sort the videos by date, descending. Once we reach a file that already exists, we can stop
# This allows us to run the script again later and only fetch new videos
videos.sort(key=lambda x: x[2], reverse=True)

# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)

# For each video, get the transcript and save it to a file if it doesn't already exist
for video_id, video_title, video_date in videos:
try:
# Remove any non-alphanumeric characters from the video title and use it as the filename
safe_title = "".join([c for c in video_title if c.isalnum() or c.isspace()]).rstrip()
filename = os.path.join(output_dir, f"{safe_title}.txt")
if os.path.exists(filename):
# If the file already exists, assume the rest are there too and stop
break
transcript = YouTubeTranscriptApi.get_transcript(video_id)
with open(filename, "w") as file:
# Write each transcript entry to the file
for entry in transcript:
file.write(entry['text'] + ' ')
print(f"Transcript saved to {safe_title}.txt")
except Exception as e:
print(f"Error fetching transcript for video ID {video_id} ({video_title}): {str(e)}")
api_key = os.getenv('YOUTUBE_API_KEY')
playlist_id = "PLob1mZcVWOaiVxrCiEyYXcAbmx7UY8ggW"
output_dir = "transcripts/ancient-aliens-official"

save_transcripts_to_files(api_key, playlist_id, output_dir)

LlamaIndex

The LlamaIndex documentation has also changed quite a bit since I last used the project back in March. I highly recommend reading through the information a bit and getting a feel for what it can do. A good place to start would be the Basic Usage Pattern tutorial. It’s pretty easy to get up and running quickly. You can read in the transcripts or whatever text you have, create an index, and then execute a query in just a few lines of code.

Another great resource for learning LlamaIndex is their Discord server. It’s very active and there are many people there that can help you out. Additionally, they have a chat bot that will answer your questions about how to use the API. It’s amazing how well it works. I’ve learned so much about the API by just playing with some code and asking the chatbot questions. A lot faster than reading through the documentation, but there will be times that you’ll want to go through it in more detail. The chatbot does a great job of pointing you in the right direction, and you can go from there.

Read Data and Create Index

Let’s go through the code below. The process begins by reading in the transcript text files using SimpleDirectoryReader. While doing so the filename gets added as metadata with a key of ‘episode_title’. This is to help bias the search later based on episode title. Additionally, a flag is passed to indicate the filename should be used as the ID for the Document. This will come in handy later when refreshing the index based on new transcripts that are downloaded. Finally, each document in the collection is tagged such that the meta data isn’t passed to the LLM, as it doesn’t necessarily need to read that when generating a response. Again, the metadata is being used to help find the chunks of data to pass to the LLM.

Next an LLM object is created using the gpt-4 model. Note, the default text generation model, if not specified, is gpt-3.5-turbo, and for creating the embeddings it is text-embedding-ada-002. When creating the LLM object, max tokens is set to 1024. This indicates the maximum length of the response we should get from the LLM. For gpt-4, the maximum number of tokens is 8192, and this includes both the input and output. More on that later.

A ServiceContext object is then created, which is used during the indexing and querying stage in a LlamaIndex pipeline/application. Chunk_size is set to 1024 tokens, which indicates the size of the chunks that the transcript text data should be broken up into. And then finally, a VectorStoreIndex is created and saved to disk. Llama-Index has many types of indexes that can be used and will be dictated by the use case. Again, I hope to explore them more in future articles.

# pip install llama-index
# pip install ipywidgets
# pip install nltk - needed for version 0.7.24 and greater

import os
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO) # Change INFO to DEBUG if you want more extensive logging
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI

from IPython.display import Markdown, display

transcript_directory = "transcripts/ancient-aliens-official"
storage_directory = "storage/ancient-aliens-official/vector"

# Add filename as metadata to each chunk associated with a document/transcript
filename_fn = lambda filename: {'episode_title': filename} #Future - consider using the below instead of full path
#filename_fn = lambda filename: {'episode_title': os.path.splitext(os.path.basename(filename))[0]}
documents = SimpleDirectoryReader(transcript_directory, filename_as_id=True,
file_metadata=filename_fn).load_data()

# Exclude metadata from the LLM, meaning it won't read it when generating a response.
# Future - consider looping over documents and setting the id_ to basename, instead of fullpath
[document.excluded_llm_metadata_keys.append('episode_title') for document in documents]

# max tokens will impact the length of the output from the LLM, for OpenAI the default is 256 tokens
llm = OpenAI(temperature=0, max_tokens=1024, model="gpt-4")

# chunk_size - It defines the size of the chunks (or nodes) that documents are broken into when they are indexed by LlamaIndex
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=1024)

# Build the index
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

# Persist the index to disk
index.storage_context.persist(persist_dir=storage_directory)

Reload Index

The index can be reloaded later so that one doesn’t have to keep rebuilding an index. The code below can be used to accomplish that.

# Now you can load the index from disk when needed, and not rebuild it each time.
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO) # Change INFO to DEBUG if you want more extensive logging
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index import StorageContext, load_index_from_storage
from llama_index.llms import OpenAI

from IPython.display import Markdown, display

transcript_directory = "transcripts/ancient-aliens-official"
storage_directory = "storage/ancient-aliens-official/vector"

llm = OpenAI(temperature=0, max_tokens=1024, model="gpt-4")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=1024)
storage_context = StorageContext.from_defaults(persist_dir=storage_directory)
index = load_index_from_storage(storage_context, service_context=service_context)

Query Index

Now the fun part, let’s query the index and see what sort of response we get back. The question is similar to what I asked in the previous article.

query_engine = index.as_query_engine()
response = query_engine.query("Summarize the Alien DNA Found in Octopus Genetics episode in 3 paragraphs")
display(Markdown(f"<b>{response}</b>"))

In March 2018, a scientific paper by a team of 33 scientists, including Dr. Chandra Wickramasinghe, claimed that octopuses possess extraterrestrial DNA. The octopus genome was found to be incredibly complex, with around 50,000 genes compared to the human’s 25,000. The scientists were amazed by their findings, as they could not find a connection to any ancestor of the octopi, suggesting that these creatures might have been brought to Earth in their entirety.

The octopus is a fascinating organism with unique characteristics. Its brain is distributed, with parts in its arms as well as the central brain in its head. Octopuses are adept at camouflaging themselves by changing their color to match their surroundings almost instantaneously. They can also manipulate their bodies into different shapes to imitate other animals. Some scientists speculate that in the absence of humans, the octopus could evolve into the dominant species on the planet due to its ability to edit its own genetic code.

The octopus’s ability to edit its own genetic code is not fully understood, but it suggests that the octopus can rapidly adapt to its environment faster than other creatures. This ability is mediated by changes in the environment, usually temperature. There are also mythological accounts from across the ancient world that suggest the octopus could be related to a race of extraterrestrials that visited Earth thousands of years ago. The question remains: is the octopus truly alien to this planet, or were our ancestors encountering intelligent creatures that were the result of alien experimentation?

Examine Source Nodes

The above summary looks pretty good to me! So what nodes or chunks were used to provide this response? The following code below can be used to see that.

# Print the number of source nodes
num_source_nodes = len(response.source_nodes)
print(f"Number of source nodes: {num_source_nodes}")

# Loop over source nodes and print meta data
for s in response.source_nodes:
print(f"Node Score: {s.score}")
print(s.node.metadata)
Number of source nodes: 2
Node Score: 0.8942623344163917
{'episode_title': 'transcripts\\ancient-aliens-official\\Alien DNA Found in Octopus Genetics Ancient Aliens.txt'}
Node Score: 0.8879728057031845
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens MindBlowing Proof of Alien DNA Found on Earth.txt'}

Above we see that two nodes/chunks were used, the node for the actual transcript we are interested in, and another one. It turns out the the transcript of interest can fit into one node, it’s relatively short. When making a vector store query, by default the top 2 node/chunks are retrieved from the index using cosine similarity between the embedding vector of the query and the embedding vectors in the index. The parameter that controls this is called similarity_top_k.

Let’s try another query by focusing on the Crystal Skulls full episode. This transcript has about 5500 words. The ratio between tokens and words is roughly 100 tokens to 75 words. Using that ratio, it can be calculated that there are about 7300 tokens in the transcript. Each chunk is 1024 tokens, which means a little over 7 nodes/chunks are needed to represent the data. Additionally, when creating a new chunk some of the previous chunk’s information is included. This is called chunk overlap and the default is 20. So with all that being said, a similarity_top_k of 8 should be sufficient to capture all the chunks needed to send to the LLM.

Let’s check if the document was indeed broken up into 8 chunks. We can use the code below to get the nodes associated with a document and count them.

from llama_index.node_parser import SimpleNodeParser

# Retrieve the document by id
document_id = 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'
document = [doc for doc in documents if doc.id_ == document_id] # this will return just 1 document

# Parse the document into nodes
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents([document[0]])

# Count the number of nodes
num_nodes = len(nodes)
print('Number of nodes: ', num_nodes)
Number of nodes:  8

Similarity Top K

So our back of the envelope calculation does indeed work! Let’s now try a query and use a value of 8 for similarity_top_k and then examine the source nodes that contributed to the response from the LLM.

query_engine = index.as_query_engine(similarity_top_k=8)
response = query_engine.query("Please provide a comprehensive summary of the Crystal Skulls full Episode in 3 detailed paragraphs.")
display(Markdown(f"<b>{response}</b>"))

The episode commences with an investigation into the enigma of the crystal skulls, artifacts shaped like human heads that are believed to hold immense knowledge and wisdom crucial to human survival. The origins of these artifacts are disputed, with some suggesting they are part of an intricate hoax, while others propose they disclose an alien plan. The episode delves into the belief held by millions worldwide that extraterrestrial beings have visited us in the past. The crystal skulls are seen as repositories of profound knowledge and wisdom, and legend has it that there are only 13, each possessing a formidable mystical power.

The narrative then delves into the history and unearthing of these skulls. In the late 1800s, museums in London and Paris exhibited crystal skulls as authentic Mesoamerican religious artifacts. However, contemporary scientists now argue that they are all recent creations. They also maintain that any analysis of the skulls is incomplete as quartz crystal cannot be carbon-dated. The episode also explores the notion that ancient societies tapped into other dimensions using crystal objects. There is a belief that crystal possesses a quality that transcends dimensions and can send and receive messages from other realms.

The episode wraps up by exploring the hypothesis that the crystal skulls could be components of an intricate ancient computer system, used for both storing and transmitting information. It discusses the concept that each crystal skull could be a type of computer chip, part of an extraterrestrial computer’s motherboard. The episode concludes with the belief that the true power of the skulls will be unveiled once all 13 legendary skulls are found. These skulls are thought to harness potent forces and disclose the truth about humanity’s purpose.

# Print the number of source nodes
num_source_nodes = len(response.source_nodes)
print(f"Number of source nodes: {num_source_nodes}")

# Loop over source nodes and print meta data
for s in response.source_nodes:
print(f"Node Score: {s.score}")
print(s.node.node_id)
print(s.node.metadata)
Number of source nodes: 8
Node Score: 0.855703817064939
7117ff0f-68eb-4f47-9f42-c262c390901c
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}
Node Score: 0.8525321796514884
1f727660-b037-4f10-b5ed-739ff14e3d46
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}
Node Score: 0.8497725026760122
49d43079-a87b-4217-b0ad-34a37ae16eb3
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}
Node Score: 0.8476291464541016
c4318e88-29b4-4148-a027-6575e002cbdc
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}
Node Score: 0.8471013560251529
00b2565f-8435-4bf4-a752-4310f14838e8
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}
Node Score: 0.8428367040811128
37eb6c07-8b8e-4fb7-9ff2-26410b0a215f
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}
Node Score: 0.838501569821164
dd786d50-1b60-4984-bdd6-0724dc669259
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}
Node Score: 0.8347476310216746
4dd79e3b-2664-4a9a-8a96-de92ceffc042
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens The Crystal Skulls S6 E2 Full Episode.txt'}

Now you might be thinking here, how can we send 8 chunks of size 1024 tokens and return a response of 1024 tokens, when the maximum allowed number of tokens is 8192? LlamaIndex handles this situation by breaking up the matching results into chunks that will fit into the prompt. This concept is called “refining” answers in LlamaIndex. After LlamaIndex gets an initial answer from the first API call, it sends the next chunk(s) to the API, along with the previous answer, and asks the model to refine that answer.

Llama Debug Handler

We can use the LlamaIndex debugging handler to explore this in more detail and take a look at the events that are happening. Note, according to the documentation, the below is a beta feature and so the API is subject to change.

The below code sets up the debug handlers. If you want to debug what happens during index construction, you can uncomment the appropriate line of code below. But I didn’t want to rebuild the index here, so I pass the updated ServiceContext to the query engine instead.

Based on the analysis above, how many LLM calls should we see? I think 2! Given the size of the 8 chunks and the size of the output response we would like, we’ll need 2 LLM calls.

from llama_index.callbacks import CallbackManager, LlamaDebugHandler, CBEventType

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

service_context = ServiceContext.from_defaults(
callback_manager=callback_manager, llm=llm
)

# If you want to debug what happens when constructing the index, you can use the following code
#index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

query_engine = index.as_query_engine(similarity_top_k=8, service_context=service_context)
response = query_engine.query("Please provide a comprehensive summary of the Crystal Skulls full Episode in 3 detailed paragraphs.")
# Print info on the LLM calls during the list index query
print(llama_debug.get_event_time_info(CBEventType.LLM))
**********
Trace: query
|_CBEventType.QUERY -> 51.384615 seconds
|_CBEventType.RETRIEVE -> 0.44382 seconds
|_CBEventType.SYNTHESIZE -> 50.939792 seconds
|_CBEventType.LLM -> 27.173721 seconds
|_CBEventType.LLM -> 23.500291 seconds
**********
EventStats(total_secs=50.674012000000005, average_secs=25.337006000000002, total_count=2)

Above we do indeed see 2 LLM calls! Now let’s examine the inputs and outputs for each LLM call. Event pairs can be used to look at this in more detail. The input for the first LLM call will be represented by [0][0], and the output by [0][1]. And the input for the second LLM call will be represented by [1][0], and its output by [1][1].

Below we create the event pairs and then print out the output of each LLM call. Since the output includes the payload information that was sent to begin with, I didn’t bother printing out the input. Here we can see things like what model was used, token usage, etc.

You are an expert Q&A system that is trusted around the world. Always answer the question using the provided context information, and not prior knowledge. Some rules to follow: Never directly reference the given context in your answer. Avoid statements like ‘Based on the context, …’ or ‘The context information …’ or anything along those lines.

The above is called the text_qa_prompt and is used in the first query. The answer to that and the next chunk or nodes are used in subsequent queries with a refine_template prompt, see below. Both of these can be customized at query time like so,

query_engine = index.as_query_engine(
text_qa_template=<custom_qa_prompt>,
refine_template=<custom_refine_prompt>
)
event_pairs = llama_debug.get_llm_inputs_outputs()
print(event_pairs[0][1])
CBEvent(event_type=<CBEventType.LLM: 'llm'>, payload={<EventPayload.MESSAGES: 'messages'>: [ChatMessage(role=<MessageRole.SYSTEM: 'system'>, content="You are an expert Q&A system that is trusted around the world.\nAlways answer the question using the provided context information, and not prior knowledge.\nSome rules to follow:\n1. Never directly reference the given context in your answer.\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.", additional_kwargs={}), ChatMessage(role=<MessageRole.USER: 'user'>, content="Context information is below. <INPUT TEXT DELETED TO SAVE SPACE HERE>{
"id": "chatcmpl-7n9oiW1yiwfyQMh7fET743Cyi1qWm",
"object": "chat.completion",
"created": 1691950164,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The episode begins with an exploration of the mysterious crystal skulls, believed by some to be relics with the power to harness the secrets of the universe. These skulls, according to some theories, could be part of an elaborate hoax or reveal an extraterrestrial agenda. The skulls are believed to contain great knowledge and wisdom, vital to the survival of the human race. The episode delves into the belief that millions of people around the world think we have been visited in the past by extraterrestrial beings. The episode also explores the idea that these skulls might be proof of ancient aliens shaping our history.\n\nThe episode then delves into the history and discovery of these crystal skulls. In the late 19th century, museums in London and Paris displayed crystal skulls as genuine Mesoamerican religious relics. However, today, mainstream scientists argue they are all modern creations. The episode also explores the idea that these skulls might have been created using very sophisticated tools in ancient times. Some people believe that because there are tool markings on the teeth of some of the crystal skulls, they are relatively modern fakes. However, others point out that the ancient Maya had wheel carving technologies.\n\nThe episode concludes with the exploration of the theory that the crystal skulls might be part of an elaborate ancient computer system. The skulls could be data gathering devices used for both recording and transmitting information. The episode also explores the idea that each crystal skull might be some type of a computer chip and that they belong to some motherboard of some extraterrestrial computer. The episode ends with the idea that the true power of the skulls will remain a mystery until all 13 legendary skulls are discovered."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 7005,
"completion_tokens": 328,
"total_tokens": 7333

}
}, delta=None)}, time='08/13/2023, 11:09:52.485753', id_='3f143769-c0f2-4dcd-af92-b2f197b74bcb')
print(event_pairs[1][1])
CBEvent(event_type=<CBEventType.LLM: 'llm'>, payload={<EventPayload.MESSAGES: 'messages'>: [ChatMessage(role=<MessageRole.USER: 'user'>, content="You are an expert Q&A system that stricly operates in two modes when refining existing answers:\n1. **Rewrite** an original answer using the new context.\n2. **Repeat** the original answer if the new context isn't useful.\nNever reference the original answer or context directly in your answer.\nWhen in doubt, just repeat the original answer. <INPUT TEXT DELETED TO SAVE SPACE HERE>{
"id": "chatcmpl-7n9p9Q2G7pPBOqJKxlhvXQZZfd65V",
"object": "chat.completion",
"created": 1691950191,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The episode commences with an investigation into the enigmatic crystal skulls, which some individuals believe to be artifacts with the ability to unlock the universe's secrets. Theories suggest that these skulls could either be part of an intricate hoax or evidence of an alien agenda. The skulls are thought to hold immense knowledge and wisdom, crucial for human survival. The episode delves into the belief held by millions worldwide that extraterrestrial beings have visited us in the past. It also explores the notion that these skulls could be evidence of ancient aliens influencing our history.\n\nThe episode then delves into the history and discovery of these crystal skulls. In the late 19th century, museums in London and Paris displayed crystal skulls as genuine Mesoamerican religious relics. However, today, mainstream scientists argue they are all modern creations. The episode also explores the idea that these skulls might have been created using very sophisticated tools in ancient times. Some people believe that because there are tool markings on the teeth of some of the crystal skulls, they are relatively modern fakes. However, others point out that the ancient Maya had wheel carving technologies.\n\nThe episode concludes with the exploration of the theory that the crystal skulls might be part of an elaborate ancient computer system. The skulls could be data gathering devices used for both recording and transmitting information. The episode also explores the idea that each crystal skull might be some type of a computer chip and that they belong to some motherboard of some extraterrestrial computer. The episode ends with the idea that the true power of the skulls will remain a mystery until all 13 legendary skulls are discovered."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 1645,
"completion_tokens": 318,
"total_tokens": 1963

}
}, delta=None)}, time='08/13/2023, 11:10:15.989049', id_='456e2096-af56-4e2e-be20-c3b61c7acfd2')

Modifying the text qa and refine prompt templates can have a dramatic effect on the output. It is something you should play with and modify, depending on your use case.

Refresh Index With New Documents

Refreshing an index with new documents is pretty easy. After adding one single new transcript to the local collection, the code below can be used to update the index. refreshed_docs is a True/False list, and there should only be 1 True value since only one new document was added. We verify that by printing out the number of refreshed documents and the path/titles for those.

# Reload documents
filename_fn = lambda filename: {'episode_title': filename}
documents = SimpleDirectoryReader(transcript_directory, filename_as_id=True, file_metadata=filename_fn).load_data()

# Refresh the index
refreshed_docs = index.refresh_ref_docs(documents,
update_kwargs={"delete_kwargs": {'delete_from_docstore': True}})

# refreshed_docs is a list of True/False values indicating whether the document was refreshed
# Print the number of refreshed documents by print the number of True values
print(sum(refreshed_docs))
1
# Expected only a single document to be refreshed, so output the id_ or filename of the refreshed document.
print(documents[refreshed_docs.index(True)].id_)
transcripts\ancient-aliens-official\Ancient Aliens 4 BAFFLING UNSOLVED STONEHENGE MYSTERIES.txt

And finally below another query is made. Notice though that the most recent documented inserted did not come back with the highest similarity score! Interesting :)

query_engine = index.as_query_engine(service_context=service_context)
response = query_engine.query("What baffling unsolved mysteries surround Stonehenge?")
display(Markdown(f"<b>{response}</b>"))
**********
Trace: query
|_CBEventType.QUERY -> 14.964499 seconds
|_CBEventType.RETRIEVE -> 0.467517 seconds
|_CBEventType.SYNTHESIZE -> 14.496982 seconds
|_CBEventType.LLM -> 14.489968 seconds
**********

There are several baffling unsolved mysteries surrounding Stonehenge. One of them is the purpose of its construction. While many mainstream scholars suggest that Stonehenge was simply a place of worship and a burial ground, others question why the ancient people of Britain would have spent more than 1,000 years to build what amounts to a church and a cemetery. Another mystery is the identity of the people who built Stonehenge. It is not known who these people were or where they came from. Additionally, it is now known that Stonehenge was part of a much larger superstructure, but the full extent and purpose of this superstructure are not yet understood. There are also questions about the complex and spectacular crop circle designs that have appeared in fields opposite Stonehenge, such as the so-called Julia set. Some speculate that these could be messages from extraterrestrial visitors.

# Print the number of source nodes
num_source_nodes = len(response.source_nodes)
print(f"Number of source nodes: {num_source_nodes}")

# Loop over source nodes and print meta data
for s in response.source_nodes:
print(f"Node Score: {s.score}")
print(s.node.metadata)
Number of source nodes: 2
Node Score: 0.874489549704079
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens Stonehenge Revealed as UFO Hotspot.txt'}
Node Score: 0.8709928897632402
{'episode_title': 'transcripts\\ancient-aliens-official\\Ancient Aliens 4 BAFFLING UNSOLVED STONEHENGE MYSTERIES.txt'}

Conclusion

Jupyter-to-Medium

This article was originally written as a Jupyter notebook in VSCode. I then used the jupyter-to-medium library to push the notebook directly to Medium. There is some setup involved, e.g., obtaining a Medium integration token (see link). But once you have that, you can than push the notebook to Medium directly from the command line in a terminal. Note, the below command line syntax didn’t work for me.

jupyter_to_medium — pub-name=”Dunder Data” — tags=”python, data science” “My Awesome Blog Post.ipynb”

All you really have to do is the below and should work just fine. From there you can go to Medium and tweak the article as you see fit. I did so with some of the code output to make it look nicer.

jupyter_to_medium “My Awesome Blog Post.ipynb”

Hope you enjoyed this article and/or found it useful! Feel free to ask any questions you might have. Thanks!!

--

--

Daniel Crouthamel

Epic Tapestry Consulting / ML Engineer, Graph DB and NLP Enthusiast / Quinquagenarian