How to Scrape Youtube Videos (easy)

Jack Paczos
Get that Data!
Published in
8 min readFeb 14, 2024

This post shows how you can easily get data from any YouTube channel.

Before I continue with this guide, make sure that you are adhering to YouTube’s terms of service and do not violate the copyright laws of the content creator as well.

Agenda:

  1. Obstacles
  2. Step-by-Step Guide
  3. Complete Solution
  4. Saving to Database (optional)

1. Obstacles

As with any scraping project, there are some things, which will make it more difficult than it must be. In the case of scraping youtube, those are the obstacles:

  • Youtube Cookies Modal
  • Dynamic Loading of Items
  • Not possible to scroll down (‘end’ key)

When we first access a YouTube URL we will encounter this Cookies Modal.

Cookies Modal

To solve this, we can use the browser automation library, Playwright.

With Playright we can iron it out by selecting and clicking on the Reject all-button, and we will be directed to our desired page.

However, there still would be the dynamic loading issue. We will not be able to select all videos, because they only load after a user scrolls down.

Dynamic Loading

Luckily we can use keyboard shortcuts to scroll down. If we arrive at the bottom, then we can access all videos there are on a given page.

But how to scroll down?

Usually, we would use the end key (fn + right arrow on macOS). Nonetheless, this is not working with YouTube and it’s scrolling. Thus I coded a script that scrolls step by step down (space key) until we arrive at the bottom.

2. Step by Step Guide

I. Importing

from rich import print
from playwright.sync_api import sync_playwright
from time import sleep
import json
  • rich: for pretty printing
  • playwright: for browser automation
  • time: to add delays
  • json: for encoding into json

II. Preparing Variables

url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
}
video_tresor = []
  • url: (necessary) for the YouTube account you want to scrape
  • custom_headers: (optional) for extra security layer
  • video_tresor: list where we save the video data

III. Launching Browser

def run_spider():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.set_viewport_size({"width": 1280, "height": 1080})
page.set_extra_http_headers(custom_headers)
page.context.set_default_timeout(60000)

page.goto(url)
page.wait_for_load_state('networkidle')
sleep(5)
  • with Playwright, we create a browser window with a specified width, height, and custom headers
  • then we navigate our URL
  • and we wait for the page to load

IV. Cookies Modal

        title = page.title()
if title == "Before you continue to YouTube":
button = page.locator('button[aria-label="Reject all"]').first
button.click()
  • we check if the cookies modal opened
  • if yes, we locate the reject button and click it

V. Scrolling down

        page.wait_for_load_state('networkidle')
page.focus("body")
more_to_load = True
video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')


while more_to_load:
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
page.keyboard.press('End')
page.keyboard.press('End')
page.keyboard.press('End')

videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()


if videos_before == videos_after:
more_to_load = False
  • we wait for the page to load
  • we set a variable called more_to_load to true
  • as long as it is true, we:
  • get the number of videos, scroll down, get the number of videos again
  • now we can compare both numbers if they are the same, it means that we have reached the end
  • if not we continue scrolling

VI. Selecting all Videos and their Data

videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
for idx in range(videos.count()):
video = videos.nth(idx)


# thumbnail
thumbnail_image = video.locator('#thumbnail img').first
thumbnail = thumbnail_image.get_attribute('src')


# title
details = video.locator('#details #meta')
title = video.locator('h3 #video-title').text_content()


# views
views = video.locator('#metadata #metadata-line span').first.text_content()
views = views.replace('views', '').strip().lower()
if 'k' in views:
views = float(views.replace('k', '').strip()) * 1000
elif 'm' in views:
views = float(views.replace('m', '').strip()) * 1000000
else:
views = float(views.strip())


# upload date
upload = video.locator('#metadata #metadata-line span').last.text_content()


# duration
duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
duration = video.locator(duration_selector).first.text_content()
duration = duration.replace('\n', '').strip()


video_data = {
'thumbnail': thumbnail,
'title': title,
'views': views,
'upload': upload,
'duration': duration
}

video_tresor.append(video_data)
  • first, we select all video elements
  • then we loop over them
  • and select whatever data we might need
  • the comments are the names of the data we are getting
  • then we save it into a dictionary called video_data
  • we add it to our video_tresor holding all video information

VII. Closing Browser

 browser.close()
  • after we have everything in the video_tresor we can close the browser

VIII. Saving to File

with open('videos.json', 'w') as file:
json.dump(video_tresor, file, indent=4)
  • this will save our video_tresor in a json file

‎‎

3. Complete Solution

Before you copy and run this code, make sure you have the correct YouTube channel in the URL variable with /videos at the end and that you have all dependencies installed.

I added here some print and sleep calls, so there’s a smaller likelihood of errors.


from rich import print
from playwright.sync_api import sync_playwright
from time import sleep

url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
}
video_tresor = []


def run_spider():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.set_viewport_size({"width": 1280, "height": 1080})
page.set_extra_http_headers(custom_headers)
page.context.set_default_timeout(60000)

page.goto(url)
page.wait_for_load_state('networkidle')
sleep(5)


title = page.title()
if title == "Before you continue to YouTube":
button = page.locator('button[aria-label="Reject all"]').first
button.click()



page.wait_for_load_state('networkidle')
page.focus("body")
more_to_load = True
video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')


while more_to_load:
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
page.keyboard.press('End')
page.keyboard.press('End')
page.keyboard.press('End')
sleep(1.5)
videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()

print("videos before", videos_before)
print("videos after", videos_after)
if videos_before == videos_after:
more_to_load = False
print('we reached the end')


videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
for idx in range(videos.count()):
video = videos.nth(idx)


# thumbnail
thumbnail_image = video.locator('#thumbnail img').first
thumbnail = thumbnail_image.get_attribute('src')
print(f"thumnail {thumbnail}")

# title
details = video.locator('#details #meta')
title = video.locator('h3 #video-title').text_content()
print(f"title {title}")

# views
views = video.locator('#metadata #metadata-line span').first.text_content()
views = views.replace('views', '').strip().lower()
if 'k' in views:
views = float(views.replace('k', '').strip()) * 1000
elif 'm' in views:
views = float(views.replace('m', '').strip()) * 1000000
else:
views = float(views.strip())
print(f"views {views}")

# upload date
upload = video.locator('#metadata #metadata-line span').last.text_content()
print(f"upload {upload}")

# duration
duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
duration = video.locator(duration_selector).first.text_content()
duration = duration.replace('\n', '').strip()
print(f"duration {duration}")


video_obj = {
'thumbnail': thumbnail,
'title': title,
'views': views,
'upload': upload,
'duration': duration
}
print(video_obj)
video_tresor.append(video_obj)

sleep(10)
print(video_tresor)
browser.close()




run_spider()

For a few minutes, you will see all the scraped video data printed in your terminal. That means, that our script is working.

A file called videos.json should appear with all of the scraped data inside.

4. Saving to Database (optional)

With a few easy steps, you can save it into your database.

I use MongoDB since it is a document database and extremely easy to use.

  1. Make an account and create a database (here is a guide ).
  2. Create a file called .env and add your MongoDB String which you created in the first step
MONGO_STRING = mongodb+srv://user:password@clustername.mongodb.net/?retryWrites=true&w=majority

Note, that this is just an example string here.

3. Install and Import the following at the top of your file

import pymongo
import os
import dotenv

4. Load and Print the MONGO_STRING to check if it is importing

dotenv.load_dotenv()
database_url = os.getenv('MONGO_STRING')
print(database_url)

5. Instead of step VIII use this code:

def to_database():
client = pymongo.MongoClient(database_url)
print('Connected to database')
try:
db = client['database_name']
collection = db['collection_name']

for i, video in enumerate(video_tresor):
print(f'inserting item {i + 1} from {len(video_tresor)}')
collection.insert_one(dict(video))
finally:
client.close()
  • database_name and collection_name are just placeholders for your real names for those
to_database()

6. Complete Code



from rich import print
from playwright.sync_api import sync_playwright
from time import sleep
import pymongo
import os
import dotenv

url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
}
video_tresor = []
dotenv.load_dotenv()
database_url = os.getenv('MONGO_STRING')

print(database_url)

def run_spider():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.set_viewport_size({"width": 1280, "height": 1080})
page.set_extra_http_headers(custom_headers)
page.context.set_default_timeout(60000)

page.goto(url)
page.wait_for_load_state('networkidle')
sleep(5)


title = page.title()
if title == "Before you continue to YouTube":
button = page.locator('button[aria-label="Reject all"]').first
button.click()



page.wait_for_load_state('networkidle')
page.focus("body")
more_to_load = True
video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')


while more_to_load:
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
page.keyboard.press('End')
page.keyboard.press('End')
page.keyboard.press('End')
sleep(1.5)
videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()

print("videos before", videos_before)
print("videos after", videos_after)
if videos_before == videos_after:
more_to_load = False
print('we reached the end')


videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
for idx in range(videos.count()):
video = videos.nth(idx)


# thumbnail
thumbnail_image = video.locator('#thumbnail img').first
thumbnail = thumbnail_image.get_attribute('src')
print(f"thumnail {thumbnail}")

# title
details = video.locator('#details #meta')
title = video.locator('h3 #video-title').text_content()
print(f"title {title}")

# views
views = video.locator('#metadata #metadata-line span').first.text_content()
views = views.replace('views', '').strip().lower()
if 'k' in views:
views = float(views.replace('k', '').strip()) * 1000
elif 'm' in views:
views = float(views.replace('m', '').strip()) * 1000000
else:
views = float(views.strip())
print(f"views {views}")

# upload date
upload = video.locator('#metadata #metadata-line span').last.text_content()
print(f"upload {upload}")

# duration
duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
duration = video.locator(duration_selector).first.text_content()
duration = duration.replace('\n', '').strip()
print(f"duration {duration}")


video_obj = {
'thumbnail': thumbnail,
'title': title,
'views': views,
'upload': upload,
'duration': duration
}
print(video_obj)
video_tresor.append(video_obj)

sleep(10)
print(video_tresor)
browser.close()



def to_database():
client = pymongo.MongoClient(database_url)
print('Connected to database')
try:
db = client['hotbooks']
collection = db['videos']

for i, video in enumerate(video_tresor):
print(f'inserting item {i + 1} from {len(video_tresor)}')
collection.insert_one(dict(video))
finally:
client.close()


run_spider()
to_database()
database insertions

‎Now we can easily slurp up all the data we want, and we have it automatically saved into our db.

Wohooo!

--

--