BulletBrief — Make videos brief with bullet points

Darakarn Limkool

12 min readJun 13, 2023

Before we delve deeper into BulletBrief, here is a link to access it:

Demo: https://huggingface.co/spaces/Darakarn/BulletBrief-t5base-modified

Hugging Face model: https://huggingface.co/Darakarn/BulletBrief-t5base

GitHub: https://github.com/Teetydrk/BulletBrief-t5base

Here is why BulletBrief was created:

I’m a TED fan, BUT there were many times when I finished a 15-minute video and forgot what the speaker was mainly talking about. My previous solution was to make bullet-point notes from a video, but sometimes I just wanted to do a quick recap of it. How should I do? After trying to find a bullet-point summarizer with conversational language through Hugging Face, I found that there were none that met my expectation. Hmm… let’s make my own!

Getting Started with Me:

In brief, my goal is to create a model called “BulletBrief” that can generate bullet-point summaries from YouTube videos.

Here are the steps to go:

Data Collection and Data Cleaning: Life is HARD.
Exploratory Data Analysis: See what I found.
Metrics: A-level exam for the model.
Model Training: Praying for GPU.
Model Validation: Does the model truly accurate?
Error Analysis: What did I do!
Deployment: Last but not least.

Data collection and data cleaning: Life is HARD

For me, I could say that the most challenging thing in this project is creating the dataset. Unfortunately, there are no publicly available subtitle and summary datasets for me to use, so I had to create them myself. I selected around 4138 YouTube videos with more than 3095358 words mostly from TEDEd and BigThink to include in my dataset. These videos are similar to podcasts, which made the cleaning process much easier. To handle the large number of files, I utilized the Python os module to simplify the system operations. However, after I had finished creating my dataset, I stumbled upon an easier way to convert YouTube links to the format. Here's the simpler process for converting YouTube links:

Get all the links of the targeted videos.

2. Convert the links to subtitles: I recommend using the YouTubeTranscriptApi, a Python library that allows you to retrieve ‘and work with transcripts from YouTube videos.

from youtube_transcript_api import YouTubeTranscriptApi

def get_subtitle(link):
    try:
        video_id = link.split("v=")[1]
        subtitle = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
        subtitle_text = " ".join([s['text'] for s in subtitle])
        return subtitle_text
    except:
        return "Subtitle not found"

3. Use GPT-3.5-turbo to generate bullet summaries: Since I couldn’t manually summarize thousands of subtitles on my own, my advisor suggested using API access to make requests to the OpenAI API and receive responses from the GPT-3.5-turbo model to generate the summaries.

Thank you image from https://platform.openai.com/docs/models/gpt-3-5

4. Data cleaning

Subtitles: Remove any unnecessary brackets and text such as “Transcriber:”, “Translator:”, and “Reviewer:”. And also, convert the subtitles into a single, lengthy string.

        cleaned_contents = re.sub(r'<.*?>', '', cleaned_contents)        
        cleaned_contents = re.sub(r'Transcriber: .*', '', cleaned_contents)
        cleaned_contents = re.sub(r'Translator: .*', '', cleaned_contents)
        cleaned_contents = re.sub(r'Reviewer: .*', '', cleaned_contents)
        cleaned_contents = re.sub(r'\([^()]*\)', '', cleaned_contents)
        cleaned_contents = cleaned_contents.replace('\n', ' ')
        cleaned_contents = ' '.join(cleaned_contents.split())

Summary: Check and remove any empty files.

import os

folder_path = "summary"
old_folder_path = "summary"

file_list = os.listdir(folder_path)
old_file_list = os.listdir(old_folder_path)

for file in file_list:
    file_path = os.path.join(folder_path, file)
    if os.path.getsize(file_path) == 0:
        os.remove(file_path)
        print(f"Deleted {file_path}")

5. Format the subtitle and summary in the following structure:

{
    "sub": "a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitlesa very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles a very long string of video subtitles",
    "summary": "- bullet point summary - bullet point summary - bullet point summary - bullet point summary"
}

Now we have the dataset ready in this desired format.

Exploratory Data Analysis: See what I found

By conducting this analysis, I was able to gain deeper insights and a better understanding of the data. And here what I found:

Word Cloud: It visually displays the most common words in the dataset, providing a quick overview of the changes from subtitles to summaries.

Top 10 most common words: It also provides a quick overview of the changes from subtitles to summaries in a more clear and quantifiable way, by presenting the frequency of each word.

Correlation between the number of words and videos: It can help me understand if there is any connection between the duration of the videos and the volume of textual content they produce.

Metrics : A-level exam for the model

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) : a set of metrics used for evaluating automatic summarization in natural language processing (NLP). It consists of 4 valid ROUGE types that provide different insights into the quality of summaries:

"rouge1": This scoring metric looks at individual words (unigrams) to evaluate the similarity between two pieces of text. It focuses on the overlap of words between the reference and summary texts.
"rouge2": This metric considers pairs of consecutive words (bigrams) to measure the similarity between the reference and summary texts. It looks at how well the bigrams match or overlap.
"rougeL": This scoring method is based on finding the longest common subsequence between the reference and summary texts. It measures how much of the reference text is captured in the summary.
"rougeLSum": Similar to "rougeL," this metric splits the text into individual lines using the newline character ("\n") before calculating the longest common subsequence. It is useful for evaluating summaries presented in a bullet-point format or with distinct lines.

Important note: My model performs abstractive summarization, generating new sentences that may not be present in the original text. As a result, ROUGE metrics, which primarily compare n-grams or common subsequences, may not accurately assess the semantic quality or linguistic fluency of abstractive summaries. Therefore, human evaluation should be implemented in this project.

2. Human Evaluation

Google form: The goal of the form is to compare BulletBrief(model A) with another summarization model(model B) in terms of efficiency and practicality. The form provides 10 video subtitles and allows for the selection of the best summary model in a blind test.
English expert: I asked my English teacher to complete the same form and provide comprehensive comments on both bullet-point summaries.

Model Training: Praying for GPU.

In this project, it was my first time creating an NLP model, so I used a guide from the Hugging Face tutorial. Personally, I consider it as the GitHub of NLP. Initially, I decided to use t5-small as the pre-trained model for my first prototype. However, after trying out several pre-trained models using Colab Pro and SageMaker, I ended up training with t5-base using VAST AI. However, what is T5??

T5 (Text-to-Text Transfer Transformer)

T5 is based on the transformer architecture, which consists of a stack of encoders and decoders. It utilizes a standard encoder-decoder framework where the encoder processes the input sequence, and the decoder generates the output sequence.
T5 converts all NLP tasks into a text input-output format.
T5 has several pre-trained variants that vary in terms of model size and training data. These variants include t5-small, t5-base, t5-large, t5–3b, and t5–11b. Generally, larger models exhibit better performance, but they require more computational resources for training. After experimenting with various T5 variants, I decided to use t5-base, which is the largest variant that can work effectively with my available GPU.

Let’s begin together, step by step.

Installing Required Packages

!pip install -q transformers 
!pip install -q datasets
!pip install -q evaluate
!pip install -q tokenizers
!pip install -q --upgrade accelerate

2. Importing Libraries:

import pandas as pd
import numpy as np
import torch

3. Importing data: I had 2 CSV files, so for easier management, I decided to combine them into a single file.

data_mix1 = pd.read_csv('complete_data_mix1.csv')
data_mix2 = pd.read_csv('complete_data_mix2.csv')
data = pd.concat([data_mix1, data_mix2], axis=0)

4. Loading data: By using the from_pandas() method, I could easily convert data from a pandas DataFrame into a format that can be utilized with the datasets library. The datasetslibrary provides standardized utilities for working with datasets.

from datasets import Dataset
custom_dataset = Dataset.from_pandas(data)

Let’s see custom_dataset.

5. Loading tokenizer

from transformers import AutoTokenizer
checkpoint = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

6. Preprocessing Function: is used to prepare the input data for a summarization task by adding a prefix that helps the model understand the task it needs to perform, tokenizing the inputs and labels, and returning them in a proper format for training and evaluation. When truncation=True, the tokenizer will cut off any text that exceeds the maximum length set.

def preprocess_function(examples):
    prefix = "summarize: "
    inputs = [prefix + doc for doc in examples["sub"]]
    model_inputs = tokenizer(inputs, max_length=4096, truncation=True)
    labels = tokenizer(examples["summary"], max_length=2048, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Let’s view object that contains all the keys in the dictionary.

7. Tokenizing the Dataset: The preprocess_function is applied to the custom_dataset to tokenize and preprocess the data. Following that, the dataset is split into two subsets using the train_test_split() function. This split results in an 80% training set and a 20% test set.

tokenized_custom_dataset = custom_dataset.map(preprocess_function, batched=True)
tokenized_custom_dataset = tokenized_custom_dataset.train_test_split(test_size=0.2)

8. Loading Evaluation Metrics

import evaluate
rouge = evaluate.load("rouge")

9. Defining Compute Metrics Function: The compute_metrics() function takes predicted and true labels, decodes them using a tokenizer, computes evaluation metrics using the Rouge method, calculates the average generation length, and returns a dictionary containing the computed metrics.

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

10. Loading the Seq2Seq Model: The Seq2Seq model is used to transform one sequence of words into another sequence. The model consists of an encoder, which processes the input sequence, and a decoder, which generates the output sequence.

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

11. Creating the Data Collator: The data_collator is used for handling the tokenization and formatting of both the input data and the corresponding labels.

from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

12. Defining Training Arguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./BulletBriefT5",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

13. Creating the Trainer and train: All the necessary components are packed into the trainer, and the training process could begin!

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_custom_dataset["train"],
    eval_dataset=tokenized_custom_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

Note: This code was modified from the Hugging Face tutorial.

Model Validation: Does the model truly accurate?

ROUGE:

Hmm… Yes, it is quite weird. Normally, ROUGE scores and training loss should decrease from epoch to epoch, but in this case, they seem to be unstable. After exploring some possibilities, I have explored the following factors:

Overfitting: This occurs when the model becomes overly specialized in the training data and struggles to generalize to unseen data.
Data quality and diversity: If the data contains inconsistencies or biases, the model may struggle to generalize effectively, leading to unstable ROUGE scores.

Personally, I think the root problem lies in the nature of my dataset. The videos cover a wide range of random topics, from environment to self-love, which can lead to inconsistencies. Additionally, the reference summaries generated by the generative AI are not human-generated and may have different formats, even when based on the same prompt. These variations may exceed my ability to thoroughly clean and standardize them.

As I mentioned, my model is an abstractive model, so I didn't expect that much TT

2. Human Evaluation: After I launched a Google Form on various Discord servers and other social media platforms, I received numerous responses from kind people, including both native and non-native English users. Here are the statistics:

57.5% of all responses selected BulletBrief (model A) as it generates better summaries.

And here is a comment from my English teacher:

Model A can help readers understand the text easily as it is clear and concise while Model B is too short and misses some points so the readers may miss some important ideas. Additionally, Model A summarizes what has been talked but Model B mostly copies the texts from the passage.

All of these stats serve as evidence that my model can be helpful, even though its effectiveness may not be 100% guaranteed. However, I am already proud of the feedback I have received.

Error Analysis: What did I do?

After generating summaries from random YouTube videos using BulletBrief, here’s what I found:

Repeating bullet points.
Some summaries don’t have any bullet points (“-”), only sentences.
Many bullet points convey the same meaning but with different phrasing.

Examples of error:

- and a video about anxiety
- Anxiety causes you to have gas problems and frequent urination
- Anxiety causes you to have an overactive gut and a tense stomach
- Anxiety causes you to have a tense gut and nervousness runs through your body
- Anxiety causes you to have a tense gut and nervousness runs through your body
- Anxiety causes you to have gas problems and frequent urination

Psych2Go is a community of Psychology experts who provide quality content tailored specifically for you! - Compatibility is not a guarantee of longevity, but it is directly related to the quality and satisfaction of our relationships
- Compatibility is a key factor in relationships, but it can be difficult to distinguish normal from incompatibility
- Compatibility is a key factor in relationships, but it can be difficult to tell between normal and incompatibility
- Compatibility is a key factor in relationships, but it can be difficult to tell between normal and incompatibility
- Compatibility is a key factor in relationships, but it can be difficult to tell between incompatibility and compatibility
- Compatibility is a key factor in relationships, but it can be difficult to tell between incompatibility and incompatibility.

By analyzing these errors, I can identify areas for improvement and make necessary adjustments to enhance the quality and accuracy of the bullet point summaries.

Deployment: Last but not least.

To share this demo, I utilized Gradio and deployed it on Hugging Face Space. The design is user-friendly, requiring only a YouTube link as input. BulletBrief will then generate a bullet summary for you!

And here is the final result:

You can check it out right here:

BulletBrief T5base Modified - a Hugging Face Space by Darakarn

Discover amazing ML apps made by the community

huggingface.co

To enhance the quality and accuracy of the bullet point summaries, I performed the following cleaning steps during the deployment:

Removed duplicates from the list of summaries.
Checked if each bullet point starts with a hyphen (“-”). If not, I added it.
Utilized the TfidfVectorizer and cosine_similarity functions to convert the raw text into a matrix representation and filter out similar bullet points based on their similarity scores.

Before I go:

This is my first blog and even my first NLP model. Hope it will be helpful, even if just a little bit. If I have made any mistakes or if anything I mentioned is incorrect or improper, I want to apologize in advance. Most importantly, I would love to give a big thank to to my advisors, teachers, and especially AI Builders. Thank you to all these amazing people who made this project a reality instead of just a dream. And now, here is my journey ;)