How to Scrape Youtube Videos (easy)

Jack Paczos

Published in

Get that Data!

8 min readFeb 14, 2024

This post shows how you can easily get data from any YouTube channel.

‎

Before I continue with this guide, make sure that you are adhering to YouTube’s terms of service and do not violate the copyright laws of the content creator as well.

‎

Agenda:

Obstacles
Step-by-Step Guide
Complete Solution
Saving to Database (optional)

‎

1. Obstacles

‎

As with any scraping project, there are some things, which will make it more difficult than it must be. In the case of scraping youtube, those are the obstacles:

Youtube Cookies Modal
Dynamic Loading of Items
Not possible to scroll down (‘end’ key)

When we first access a YouTube URL we will encounter this Cookies Modal.

To solve this, we can use the browser automation library, Playwright.

With Playright we can iron it out by selecting and clicking on the Reject all-button, and we will be directed to our desired page.

However, there still would be the dynamic loading issue. We will not be able to select all videos, because they only load after a user scrolls down.

Luckily we can use keyboard shortcuts to scroll down. If we arrive at the bottom, then we can access all videos there are on a given page.

But how to scroll down?

Usually, we would use the end key (fn + right arrow on macOS). Nonetheless, this is not working with YouTube and it’s scrolling. Thus I coded a script that scrolls step by step down (space key) until we arrive at the bottom.

‎

2. Step by Step Guide

‎

I. Importing

from rich import print
from playwright.sync_api import sync_playwright
from time import sleep
import json

rich: for pretty printing
playwright: for browser automation
time: to add delays
json: for encoding into json

‎

II. Preparing Variables

url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
}
video_tresor = []

url: (necessary) for the YouTube account you want to scrape
custom_headers: (optional) for extra security layer
video_tresor: list where we save the video data

‎

III. Launching Browser

def run_spider():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page() 
        page.set_viewport_size({"width": 1280, "height": 1080})
        page.set_extra_http_headers(custom_headers)
        page.context.set_default_timeout(60000)
        
        page.goto(url) 
        page.wait_for_load_state('networkidle')
        sleep(5)

with Playwright, we create a browser window with a specified width, height, and custom headers
then we navigate our URL
and we wait for the page to load

‎

IV. Cookies Modal

        title = page.title()
        if title == "Before you continue to YouTube":
            button = page.locator('button[aria-label="Reject all"]').first
            button.click()

we check if the cookies modal opened
if yes, we locate the reject button and click it

‎

V. Scrolling down

        page.wait_for_load_state('networkidle')
        page.focus("body")
        more_to_load = True
        video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
        videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')

        
        while more_to_load:
            videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
            page.keyboard.press('End')
            page.keyboard.press('End')
            page.keyboard.press('End')

            videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
            

            if videos_before == videos_after:
                more_to_load = False

we wait for the page to load
we set a variable called more_to_load to true
as long as it is true, we:
get the number of videos, scroll down, get the number of videos again
now we can compare both numbers if they are the same, it means that we have reached the end
if not we continue scrolling

‎

VI. Selecting all Videos and their Data

videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
        for idx in range(videos.count()):
            video = videos.nth(idx)
            
            
            # thumbnail
            thumbnail_image = video.locator('#thumbnail img').first
            thumbnail = thumbnail_image.get_attribute('src')

            
            # title
            details = video.locator('#details #meta')
            title = video.locator('h3 #video-title').text_content()
            

            # views
            views = video.locator('#metadata #metadata-line span').first.text_content()
            views = views.replace('views', '').strip().lower()
            if 'k' in views:
                views = float(views.replace('k', '').strip()) * 1000
            elif 'm' in views:
                views = float(views.replace('m', '').strip()) * 1000000
            else:
                views = float(views.strip())

            
            # upload date
            upload = video.locator('#metadata #metadata-line span').last.text_content()
            

            # duration
            duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
            duration = video.locator(duration_selector).first.text_content()
            duration = duration.replace('\n', '').strip()
            
            
            video_data = {
                'thumbnail': thumbnail,
                'title': title,
                'views': views,
                'upload': upload,
                'duration': duration
            }

            video_tresor.append(video_data)

first, we select all video elements
then we loop over them
and select whatever data we might need
the comments are the names of the data we are getting
then we save it into a dictionary called video_data
we add it to our video_tresor holding all video information

‎

VII. Closing Browser

 browser.close()

after we have everything in the video_tresor we can close the browser

‎

VIII. Saving to File

with open('videos.json', 'w') as file:
    json.dump(video_tresor, file, indent=4)

this will save our video_tresor in a json file

‎

‎‎

3. Complete Solution

‎

Before you copy and run this code, make sure you have the correct YouTube channel in the URL variable with /videos at the end and that you have all dependencies installed.

I added here some print and sleep calls, so there’s a smaller likelihood of errors.


from rich import print
from playwright.sync_api import sync_playwright
from time import sleep

url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
}
video_tresor = []

        
def run_spider():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page() 
        page.set_viewport_size({"width": 1280, "height": 1080})
        page.set_extra_http_headers(custom_headers)
        page.context.set_default_timeout(60000)
        
        page.goto(url) 
        page.wait_for_load_state('networkidle')
        sleep(5)

        
        title = page.title()
        if title == "Before you continue to YouTube":
            button = page.locator('button[aria-label="Reject all"]').first
            button.click()
            
            
            
        page.wait_for_load_state('networkidle')
        page.focus("body")
        more_to_load = True
        video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
        videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')

        
        while more_to_load:
            videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
            page.keyboard.press('End')
            page.keyboard.press('End')
            page.keyboard.press('End')
            sleep(1.5)
            videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
            
            print("videos before", videos_before)
            print("videos after", videos_after)
            if videos_before == videos_after:
                more_to_load = False
                print('we reached the end')
        
        
        videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
        for idx in range(videos.count()):
            video = videos.nth(idx)
            
            
            # thumbnail
            thumbnail_image = video.locator('#thumbnail img').first
            thumbnail = thumbnail_image.get_attribute('src')
            print(f"thumnail {thumbnail}")
            
            # title
            details = video.locator('#details #meta')
            title = video.locator('h3 #video-title').text_content()
            print(f"title {title}")
            
            # views
            views = video.locator('#metadata #metadata-line span').first.text_content()
            views = views.replace('views', '').strip().lower()
            if 'k' in views:
                views = float(views.replace('k', '').strip()) * 1000
            elif 'm' in views:
                views = float(views.replace('m', '').strip()) * 1000000
            else:
                views = float(views.strip())
            print(f"views {views}")
            
            # upload date
            upload = video.locator('#metadata #metadata-line span').last.text_content()
            print(f"upload {upload}")
            
            # duration
            duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
            duration = video.locator(duration_selector).first.text_content()
            duration = duration.replace('\n', '').strip()
            print(f"duration {duration}")
            
            
            video_obj = {
                'thumbnail': thumbnail,
                'title': title,
                'views': views,
                'upload': upload,
                'duration': duration
            }
            print(video_obj)
            video_tresor.append(video_obj)
        
        sleep(10)
        print(video_tresor)
        browser.close()
        
        
        

run_spider()

For a few minutes, you will see all the scraped video data printed in your terminal. That means, that our script is working.

A file called videos.json should appear with all of the scraped data inside.

‎

4. Saving to Database (optional)

With a few easy steps, you can save it into your database.

I use MongoDB since it is a document database and extremely easy to use.

Make an account and create a database (here is a guide ).
Create a file called .env and add your MongoDB String which you created in the first step

MONGO_STRING = mongodb+srv://user:password@clustername.mongodb.net/?retryWrites=true&w=majority

Note, that this is just an example string here.

3. Install and Import the following at the top of your file

import pymongo
import os
import dotenv

4. Load and Print the MONGO_STRING to check if it is importing

dotenv.load_dotenv()
database_url = os.getenv('MONGO_STRING')
print(database_url)

5. Instead of step VIII use this code:

def to_database():
    client = pymongo.MongoClient(database_url)
    print('Connected to database')
    try:
        db = client['database_name']
        collection = db['collection_name']
        
        for i, video in enumerate(video_tresor):
            print(f'inserting item {i + 1} from {len(video_tresor)}')
            collection.insert_one(dict(video))
    finally:
        client.close()

database_name and collection_name are just placeholders for your real names for those

to_database()

6. Complete Code



from rich import print
from playwright.sync_api import sync_playwright
from time import sleep
import pymongo
import os
import dotenv

url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
}
video_tresor = []
dotenv.load_dotenv()
database_url = os.getenv('MONGO_STRING')

print(database_url)
        
def run_spider():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page() 
        page.set_viewport_size({"width": 1280, "height": 1080})
        page.set_extra_http_headers(custom_headers)
        page.context.set_default_timeout(60000)
        
        page.goto(url) 
        page.wait_for_load_state('networkidle')
        sleep(5)

        
        title = page.title()
        if title == "Before you continue to YouTube":
            button = page.locator('button[aria-label="Reject all"]').first
            button.click()
            
            
            
        page.wait_for_load_state('networkidle')
        page.focus("body")
        more_to_load = True
        video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
        videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')

        
        while more_to_load:
            videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
            page.keyboard.press('End')
            page.keyboard.press('End')
            page.keyboard.press('End')
            sleep(1.5)
            videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
            
            print("videos before", videos_before)
            print("videos after", videos_after)
            if videos_before == videos_after:
                more_to_load = False
                print('we reached the end')
        
        
        videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
        for idx in range(videos.count()):
            video = videos.nth(idx)
            
            
            # thumbnail
            thumbnail_image = video.locator('#thumbnail img').first
            thumbnail = thumbnail_image.get_attribute('src')
            print(f"thumnail {thumbnail}")
            
            # title
            details = video.locator('#details #meta')
            title = video.locator('h3 #video-title').text_content()
            print(f"title {title}")
            
            # views
            views = video.locator('#metadata #metadata-line span').first.text_content()
            views = views.replace('views', '').strip().lower()
            if 'k' in views:
                views = float(views.replace('k', '').strip()) * 1000
            elif 'm' in views:
                views = float(views.replace('m', '').strip()) * 1000000
            else:
                views = float(views.strip())
            print(f"views {views}")
            
            # upload date
            upload = video.locator('#metadata #metadata-line span').last.text_content()
            print(f"upload {upload}")
            
            # duration
            duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
            duration = video.locator(duration_selector).first.text_content()
            duration = duration.replace('\n', '').strip()
            print(f"duration {duration}")
            
            
            video_obj = {
                'thumbnail': thumbnail,
                'title': title,
                'views': views,
                'upload': upload,
                'duration': duration
            }
            print(video_obj)
            video_tresor.append(video_obj)
        
        sleep(10)
        print(video_tresor)
        browser.close()
        
        
        
def to_database():
    client = pymongo.MongoClient(database_url)
    print('Connected to database')
    try:
        db = client['hotbooks']
        collection = db['videos']
        
        for i, video in enumerate(video_tresor):
            print(f'inserting item {i + 1} from {len(video_tresor)}')
            collection.insert_one(dict(video))
    finally:
        client.close()


run_spider()
to_database()