How to Scrape Youtube Videos (easy)
This post shows how you can easily get data from any YouTube channel.
Before I continue with this guide, make sure that you are adhering to YouTube’s terms of service and do not violate the copyright laws of the content creator as well.
Agenda:
- Obstacles
- Step-by-Step Guide
- Complete Solution
- Saving to Database (optional)
1. Obstacles
As with any scraping project, there are some things, which will make it more difficult than it must be. In the case of scraping youtube, those are the obstacles:
- Youtube Cookies Modal
- Dynamic Loading of Items
- Not possible to scroll down (‘end’ key)
When we first access a YouTube URL we will encounter this Cookies Modal.
To solve this, we can use the browser automation library, Playwright.
With Playright we can iron it out by selecting and clicking on the Reject all-button, and we will be directed to our desired page.
However, there still would be the dynamic loading issue. We will not be able to select all videos, because they only load after a user scrolls down.
Luckily we can use keyboard shortcuts to scroll down. If we arrive at the bottom, then we can access all videos there are on a given page.
But how to scroll down?
Usually, we would use the end key (fn + right arrow on macOS). Nonetheless, this is not working with YouTube and it’s scrolling. Thus I coded a script that scrolls step by step down (space key) until we arrive at the bottom.
2. Step by Step Guide
I. Importing
from rich import print
from playwright.sync_api import sync_playwright
from time import sleep
import json
- rich: for pretty printing
- playwright: for browser automation
- time: to add delays
- json: for encoding into json
II. Preparing Variables
url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
}
video_tresor = []
- url: (necessary) for the YouTube account you want to scrape
- custom_headers: (optional) for extra security layer
- video_tresor: list where we save the video data
III. Launching Browser
def run_spider():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.set_viewport_size({"width": 1280, "height": 1080})
page.set_extra_http_headers(custom_headers)
page.context.set_default_timeout(60000)
page.goto(url)
page.wait_for_load_state('networkidle')
sleep(5)
- with Playwright, we create a browser window with a specified width, height, and custom headers
- then we navigate our URL
- and we wait for the page to load
IV. Cookies Modal
title = page.title()
if title == "Before you continue to YouTube":
button = page.locator('button[aria-label="Reject all"]').first
button.click()
- we check if the cookies modal opened
- if yes, we locate the reject button and click it
V. Scrolling down
page.wait_for_load_state('networkidle')
page.focus("body")
more_to_load = True
video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')
while more_to_load:
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
page.keyboard.press('End')
page.keyboard.press('End')
page.keyboard.press('End')
videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
if videos_before == videos_after:
more_to_load = False
- we wait for the page to load
- we set a variable called more_to_load to true
- as long as it is true, we:
- get the number of videos, scroll down, get the number of videos again
- now we can compare both numbers if they are the same, it means that we have reached the end
- if not we continue scrolling
VI. Selecting all Videos and their Data
videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
for idx in range(videos.count()):
video = videos.nth(idx)
# thumbnail
thumbnail_image = video.locator('#thumbnail img').first
thumbnail = thumbnail_image.get_attribute('src')
# title
details = video.locator('#details #meta')
title = video.locator('h3 #video-title').text_content()
# views
views = video.locator('#metadata #metadata-line span').first.text_content()
views = views.replace('views', '').strip().lower()
if 'k' in views:
views = float(views.replace('k', '').strip()) * 1000
elif 'm' in views:
views = float(views.replace('m', '').strip()) * 1000000
else:
views = float(views.strip())
# upload date
upload = video.locator('#metadata #metadata-line span').last.text_content()
# duration
duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
duration = video.locator(duration_selector).first.text_content()
duration = duration.replace('\n', '').strip()
video_data = {
'thumbnail': thumbnail,
'title': title,
'views': views,
'upload': upload,
'duration': duration
}
video_tresor.append(video_data)
- first, we select all video elements
- then we loop over them
- and select whatever data we might need
- the comments are the names of the data we are getting
- then we save it into a dictionary called video_data
- we add it to our video_tresor holding all video information
VII. Closing Browser
browser.close()
- after we have everything in the video_tresor we can close the browser
VIII. Saving to File
with open('videos.json', 'w') as file:
json.dump(video_tresor, file, indent=4)
- this will save our video_tresor in a json file
3. Complete Solution
Before you copy and run this code, make sure you have the correct YouTube channel in the URL variable with /videos at the end and that you have all dependencies installed.
I added here some print and sleep calls, so there’s a smaller likelihood of errors.
from rich import print
from playwright.sync_api import sync_playwright
from time import sleep
url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
}
video_tresor = []
def run_spider():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.set_viewport_size({"width": 1280, "height": 1080})
page.set_extra_http_headers(custom_headers)
page.context.set_default_timeout(60000)
page.goto(url)
page.wait_for_load_state('networkidle')
sleep(5)
title = page.title()
if title == "Before you continue to YouTube":
button = page.locator('button[aria-label="Reject all"]').first
button.click()
page.wait_for_load_state('networkidle')
page.focus("body")
more_to_load = True
video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')
while more_to_load:
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
page.keyboard.press('End')
page.keyboard.press('End')
page.keyboard.press('End')
sleep(1.5)
videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
print("videos before", videos_before)
print("videos after", videos_after)
if videos_before == videos_after:
more_to_load = False
print('we reached the end')
videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
for idx in range(videos.count()):
video = videos.nth(idx)
# thumbnail
thumbnail_image = video.locator('#thumbnail img').first
thumbnail = thumbnail_image.get_attribute('src')
print(f"thumnail {thumbnail}")
# title
details = video.locator('#details #meta')
title = video.locator('h3 #video-title').text_content()
print(f"title {title}")
# views
views = video.locator('#metadata #metadata-line span').first.text_content()
views = views.replace('views', '').strip().lower()
if 'k' in views:
views = float(views.replace('k', '').strip()) * 1000
elif 'm' in views:
views = float(views.replace('m', '').strip()) * 1000000
else:
views = float(views.strip())
print(f"views {views}")
# upload date
upload = video.locator('#metadata #metadata-line span').last.text_content()
print(f"upload {upload}")
# duration
duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
duration = video.locator(duration_selector).first.text_content()
duration = duration.replace('\n', '').strip()
print(f"duration {duration}")
video_obj = {
'thumbnail': thumbnail,
'title': title,
'views': views,
'upload': upload,
'duration': duration
}
print(video_obj)
video_tresor.append(video_obj)
sleep(10)
print(video_tresor)
browser.close()
run_spider()
For a few minutes, you will see all the scraped video data printed in your terminal. That means, that our script is working.
A file called videos.json should appear with all of the scraped data inside.
4. Saving to Database (optional)
With a few easy steps, you can save it into your database.
I use MongoDB since it is a document database and extremely easy to use.
- Make an account and create a database (here is a guide ).
- Create a file called .env and add your MongoDB String which you created in the first step
MONGO_STRING = mongodb+srv://user:password@clustername.mongodb.net/?retryWrites=true&w=majority
Note, that this is just an example string here.
3. Install and Import the following at the top of your file
import pymongo
import os
import dotenv
4. Load and Print the MONGO_STRING to check if it is importing
dotenv.load_dotenv()
database_url = os.getenv('MONGO_STRING')
print(database_url)
5. Instead of step VIII use this code:
def to_database():
client = pymongo.MongoClient(database_url)
print('Connected to database')
try:
db = client['database_name']
collection = db['collection_name']
for i, video in enumerate(video_tresor):
print(f'inserting item {i + 1} from {len(video_tresor)}')
collection.insert_one(dict(video))
finally:
client.close()
- database_name and collection_name are just placeholders for your real names for those
to_database()
6. Complete Code
from rich import print
from playwright.sync_api import sync_playwright
from time import sleep
import pymongo
import os
import dotenv
url = 'https://www.youtube.com/@TomBilyeu/videos'
custom_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
}
video_tresor = []
dotenv.load_dotenv()
database_url = os.getenv('MONGO_STRING')
print(database_url)
def run_spider():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.set_viewport_size({"width": 1280, "height": 1080})
page.set_extra_http_headers(custom_headers)
page.context.set_default_timeout(60000)
page.goto(url)
page.wait_for_load_state('networkidle')
sleep(5)
title = page.title()
if title == "Before you continue to YouTube":
button = page.locator('button[aria-label="Reject all"]').first
button.click()
page.wait_for_load_state('networkidle')
page.focus("body")
more_to_load = True
video = page.locator('#content.style-scope.ytd-rich-item-renderer').first
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer')
while more_to_load:
videos_before = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
page.keyboard.press('End')
page.keyboard.press('End')
page.keyboard.press('End')
sleep(1.5)
videos_after = page.locator('#content.style-scope.ytd-rich-item-renderer').count()
print("videos before", videos_before)
print("videos after", videos_after)
if videos_before == videos_after:
more_to_load = False
print('we reached the end')
videos = page.locator('#content.style-scope.ytd-rich-item-renderer')
for idx in range(videos.count()):
video = videos.nth(idx)
# thumbnail
thumbnail_image = video.locator('#thumbnail img').first
thumbnail = thumbnail_image.get_attribute('src')
print(f"thumnail {thumbnail}")
# title
details = video.locator('#details #meta')
title = video.locator('h3 #video-title').text_content()
print(f"title {title}")
# views
views = video.locator('#metadata #metadata-line span').first.text_content()
views = views.replace('views', '').strip().lower()
if 'k' in views:
views = float(views.replace('k', '').strip()) * 1000
elif 'm' in views:
views = float(views.replace('m', '').strip()) * 1000000
else:
views = float(views.strip())
print(f"views {views}")
# upload date
upload = video.locator('#metadata #metadata-line span').last.text_content()
print(f"upload {upload}")
# duration
duration_selector = '#thumbnail #overlays ytd-thumbnail-overlay-time-status-renderer #time-status span#text'
duration = video.locator(duration_selector).first.text_content()
duration = duration.replace('\n', '').strip()
print(f"duration {duration}")
video_obj = {
'thumbnail': thumbnail,
'title': title,
'views': views,
'upload': upload,
'duration': duration
}
print(video_obj)
video_tresor.append(video_obj)
sleep(10)
print(video_tresor)
browser.close()
def to_database():
client = pymongo.MongoClient(database_url)
print('Connected to database')
try:
db = client['hotbooks']
collection = db['videos']
for i, video in enumerate(video_tresor):
print(f'inserting item {i + 1} from {len(video_tresor)}')
collection.insert_one(dict(video))
finally:
client.close()
run_spider()
to_database()
Now we can easily slurp up all the data we want, and we have it automatically saved into our db.
Wohooo!