Collecting & Cleaning YouTube Transcripts

Published in

Social Impact Analytics

4 min readApr 22, 2021

With over 500 hours of video uploaded to YouTube every minute worldwide, these videos account for a huge portion of available data in our world. This guide will walk you through how to collect a video’s transcript & clean the text so that it is ready for you to analyze.

Step 1: Save the YouTube Video

Creators on YouTube are constantly updating, changing and deleting their content. This is important for anyone working with video data sources to recognize, as the video you analyze today may not be available to view tomorrow. It is important to always preserve your raw data — in this case the video — as it is at the time of processing, for analysis integrity. So before beginning any transcript collection, take some time to download and store the video in your analysis. There are many resources out there that will help you do this, like the YouTube Downloader.

If memory on your computer is a concern, these videos can be uploaded to a folder in Google Drive or another cloud storage platform.

Step 2: Download Video Transcripts

Once you’ve created a back up for all the videos you wish to analyze youtube_d1 is a fantastic package that allows you download the subtitles YouTube has on the video.

After installing the package you can use the below function to download the subtitles. This function takes in a YouTube video’s url and a language specification and returns a downloaded transcript in a VTT file with the URL as the file name.

Once downloaded, the videos subtitles will be saved in a VTT file in this format:

This format is easy to read and reference but contains some language errors, in addition to not being in a format that would be conducive to natural language processing.

Step 3: Build a Data Dictionary

In order to work with the transcripts once the videos have been downloaded, it is useful to read them into to your working notebook and store them in a data dictionary. The function below takes in a list of file names stored as VTT files and returns a data dictionary that contains each file name (video URL) as a key and the file contents (subtitles) as a value.

Function that reads in VTT files and returns a data dictionary

With that function you can create a dictionary of each video file and the corresponding subtitle transcript:

Step 4: Build a Data Frame to Store Unprocessed & Cleaned Text

Finally you can build a data frame that contains the file name (URL), the original text detected by youtube_dl, and a column containing cleaned text.

Build a dataframe with url, text and clean text columns

Note — the red line indicates that the code string is longer than the recommended 80 characters by ASCII. As a coder it’s always important to consider readability when programming and whether a longer line or multi lines will lead to better readability. In this case a longer line of code was deemed better from a readability standpoint.

The above code applied basic cleaning to the text, and additional cleaning to remove other escape codes or text specific errors may be needed. To add those in, follow the format outlined above using the lambda function to apply changes to each row’s text within the data frame.

Step 5: Saving Your Data Frame to a CSV

Finally you can save your data frame to a CSV to be examined in Sheets/Excel or referenced later in a different working notebook.

And there you have it! A working data frame of video transcripts ready for Natural Language Processing analysis!