Youtube Data Extraction: Easily Scrape All Videos And Comments From A Channel

Published in

AI Creators

3 min readJul 1, 2024

YouTube is one of the biggest platforms for content creation and consumption.

And whether you’re a data scientist, market researcher, or content creator, the ability to extract comprehensive data from a YouTube channel can provide invaluable insights.

This tutorial introduces a powerful yet straightforward method to scrape all videos and comments from any YouTube channel.

Using Python and the YouTube Data API v3, we’ll build a robust scraper that can:

1. Fetch all videos from a specified channel

2. Extract detailed information for each video

3. Collect all comments and replies for every video

4. Store the data in an easily analyzable format

Our approach prioritizes efficiency and simplicity, making it accessible for programmers of all skill levels. By the end of this guide, you’ll have a versatile tool at your disposal, capable of gathering extensive YouTube data with minimal effort.

Let’s dive into the world of YouTube data extraction and unlock the potential of channel-wide content analysis!

For this tutorial, all you’ll need is a YouTube Data API v3 key to download the api_crawler package by using `pip install api_crawler`.

PS: Visit this link for the code notebook.

Setting Up

First, let’s import the necessary libraries and set up our API key:

from api_crawler import YoutubeAPI
import time
import pandas as pd


# OpenAI youtube channel ID
channels_to_monitor = ['UCXZCJLdBC09xxGZ6gcdrc6A']

YOUTUBE_V3_API_KEY = 'YOUR_YOUTUBE_DATA_V3_API_KEY'

youtube_api = YoutubeAPI(api_key=YOUTUBE_V3_API_KEY)

Make sure to replace ’YOUR_YOUTUBE_DATA_V3_API_KEY’ with your actual API key.

Fetching Videos from the Channel

Now that we’ve set up our environment, let’s start by fetching all the videos from our target channel:

videos = youtube_api.get_videos_from_channel(channels_to_monitor[0], get_full_info=True)

This code retrieves all videos from the specified channel, including detailed information for each video.

Collecting Comments for Each Video

Next, we’ll iterate through each video and collect all comments and replies:

for i, video in enumerate(videos):
    comments = youtube_api.get_all_comments(video['id'], include_metadata=True, include_replies = False)

    comment_threads = [comment['id'] for comment in comments if comment.get('snippet', {}).get('totalReplyCount', 0) > 0]
    
    for comment_thread in comment_threads:
        youtube_api.get_all_comment_replies(comment_thread)

    if i % 25 == 0:
        print(f'Processed {i} videos')
        time.sleep(60*3)

This loop collects all comments for each video and fetches replies for comments that have them. We’ve included a pause every 25 videos to avoid hitting API rate limits (which happens quite frequently if you’re making too many requests).

Processing the Collected Data

After collecting the data, we’ll process it into pandas DataFrames for easy analysis.

Notice that, in the code above, I am not saving the output to a list. And that’s because the `api_crawler` package already saves the output to a json file to serves as a data lake.

You could save the output to a list if you wanted, and that would not be wrong, but the entire context and metadata would be lost whenever you leave the notebook. That’s why I prefer to use the local json data lake to save the output.

In the code below, we will read the json data lake and process the data into a pandas DataFrame:

from api_crawler.data_lake import read_log

videos_log = [video 
              for output in read_log('lake/json_lakes/YoutubeAPI_get_videos_from_channel.json')
              if output.get('output') is not None
              for video in output['output']]


videos_df = pd.json_normalize(videos_log)[['id', 'channel.id', 'channel.name','title', 'publishDate', 'duration.secondsText', 'viewCount.text']]

Process comment data

comments_log = [comment 
                for output in read_log('lake/json_lakes/YoutubeAPI_get_all_comments.json')
                if output.get('output') is not None
                for comment in output['output']]

comments_df = pd.json_normalize(comments_log)[['id', 'snippet.videoId','snippet.topLevelComment.snippet.authorDisplayName',
                                               'snippet.topLevelComment.snippet.textDisplay', 'snippet.topLevelComment.snippet.publishedAt',
                                               'snippet.topLevelComment.snippet.likeCount', 'snippet.totalReplyCount']]

Process comment replies data

flattened_replies = [reply 
                    for comment in read_log('lake/json_lakes/YoutubeAPI_get_all_comment_replies.json')
                    for reply in comment.get('output', []) if comment.get('output')]

comment_thread_replies_df = pd.json_normalize(flattened_replies)[['id', 'snippet.parentId','snippet.authorDisplayName',
                                                                  'snippet.textOriginal', 'snippet.publishedAt', 'snippet.likeCount']]

Saving the Results

Finally, we’ll save our processed data to Excel files for further analysis:

videos_df.to_excel('openai_yt_videos.xlsx')
comments_df.to_excel('openai_yt_comments.xlsx')
comment_thread_replies_df.to_excel('openai_yt_comment_thread_replies.xlsx')

And there you have it!

You’ve successfully scraped all videos and comments from a YouTube channel. The data is now ready for your analysis, whether you’re looking at content trends, engagement metrics, or performing sentiment analysis on comments.

For any channel you want to scrape, you can just replace/add the channel ID in the `channels_to_monitor` list and run the code.