Collecting & Cleaning YouTube Transcripts
With over 500 hours of video uploaded to YouTube every minute worldwide, these videos account for a huge portion of available data in our world. This guide will walk you through how to collect a video’s transcript & clean the text so that it is ready for you to analyze.
Step 1: Save the YouTube Video
Creators on YouTube are constantly updating, changing and deleting their content. This is important for anyone working with video data sources to recognize, as the video you analyze today may not be available to view tomorrow. It is important to always preserve your raw data — in this case the video — as it is at the time of processing, for analysis integrity. So before beginning any transcript collection, take some time to download and store the video in your analysis. There are many resources out there that will help you do this, like the YouTube Downloader.
If memory on your computer is a concern, these videos can be uploaded to a folder in Google Drive or another cloud storage platform.
Step 2: Download Video Transcripts
Once you’ve created a back up for all the videos you wish to analyze youtube_d1 is a fantastic package that allows you download the subtitles YouTube has on the video.
After installing the package you can use the below function to download the subtitles. This function takes in a YouTube video’s url and a language specification and returns a downloaded transcript in a VTT file with the URL as the file name.
Once downloaded, the videos subtitles will be saved in a VTT file in this format:
This format is easy to read and reference but contains some language errors, in addition to not being in a format that would be conducive to natural language processing.
Step 3: Build a Data Dictionary
In order to work with the transcripts once the videos have been downloaded, it is useful to read them into to your working notebook and store them in a data dictionary. The function below takes in a list of file names stored as VTT files and returns a data dictionary that contains each file name (video URL) as a key and the file contents (subtitles) as a value.
With that function you can create a dictionary of each video file and the corresponding subtitle transcript:
Step 4: Build a Data Frame to Store Unprocessed & Cleaned Text
Finally you can build a data frame that contains the file name (URL), the original text detected by youtube_dl, and a column containing cleaned text.
Note — the red line indicates that the code string is longer than the recommended 80 characters by ASCII. As a coder it’s always important to consider readability when programming and whether a longer line or multi lines will lead to better readability. In this case a longer line of code was deemed better from a readability standpoint.
The above code applied basic cleaning to the text, and additional cleaning to remove other escape codes or text specific errors may be needed. To add those in, follow the format outlined above using the lambda function to apply changes to each row’s text within the data frame.
Step 5: Saving Your Data Frame to a CSV
Finally you can save your data frame to a CSV to be examined in Sheets/Excel or referenced later in a different working notebook.
And there you have it! A working data frame of video transcripts ready for Natural Language Processing analysis!