Scraping Youtube Comments for NLP Analysis

Ashwin Singaram
5 min readJun 20, 2024

--

Thanks for dropping by, hope this article helps you in some way or the other!

Photo by Alexander Shatov on Unsplash

I wanted to try out a simple NLP project with which we could scrape the comments from a youtube video and do some analysis.

I was actually researching about perfumes obviously got dragged into it when i was mindlessly scrolling instagram and got hit by a perfume ad. Thereby magically landing into perfume reviews, through which i found this perfume review videos by Demi Rawling….(hot af).

Youtube link — https://youtu.be/oJqc2tLMObg

Its a video on Top 10 Tom Ford perfumes. Quite an old video though(4 years)

First, install a ChromeDriver if you don’t have from right here. I tried running this on VS Code. Libraries needed are…

import sys
import time
import pandas as pd
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

then comes the scraping part…


import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import sys
from datetime import datetime

# Initialize the WebDriver with options to improve performance
options = webdriver.ChromeOptions()
options.add_argument("--headless") # Run in headless mode
options.add_argument("--disable-gpu") # Disable GPU rendering
options.add_argument("--no-sandbox") # Bypass OS security model
options.add_argument("--disable-dev-shm-usage") # Overcome limited resource problems
options.add_argument("--start-maximized") # Maximize window

driver = webdriver.Chrome(options=options)

data = []
youtube_video_url = "https://youtu.be/oJqc2tLMObg"
wait = WebDriverWait(driver, 30)

# Open the YouTube video URL
driver.get(youtube_video_url)
print("Opened YouTube URL")

# Scroll down to load comments
for item in range(150): # Define the number of scrolls here
try:
body = wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body")))
body.send_keys(Keys.END)
sys.stdout.write(f"\rScrolled {item + 1} times")
sys.stdout.flush()
time.sleep(1.5) # Increased sleep time for better loading
except Exception as e:
print(f"Exception during scrolling: {e}")
break

# Extract comments
try:
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#contents #contents")))
comments = driver.find_elements(By.CSS_SELECTOR, "#content #content-text")
print(f"\nFound {len(comments)} comments elements")

user_id = 1 # Initialize unique user ID
for comment in comments:
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
data.append({"User ID": user_id, "Comment": comment.text, "Timestamp": timestamp})
user_id += 1

# Remove duplicates
data = [dict(t) for t in {tuple(d.items()) for d in data}]
print(f"Comments captured: {len(data)}")
except Exception as e:
print(f"Exception during comment extraction: {e}")

driver.quit()

# Create DataFrame
df = pd.DataFrame(data, columns=["User ID", "Comment", "Timestamp"])

# Display the DataFrame
print(df)

# Save DataFrame to a CSV file (optional)
df.to_csv("youtube_comments.csv", index=False)

Here, i’ve used number of scrolls to scrape comments. You could also convert it to max number of comments, but this works best for now.

So there are couple of assumptions and constraints here,

Assumptions and Constraints: I tried scraping the username of each comment so that if multiple comments are made by the same author, could be combined but since it was taking a lot of time by the chrome driver and assuming comments are not made mindlessly repetitive multiple times, I have just considered each comment to be unique with an added timestamp one extracted date.

Have ran the chrome driver to be headless for faster response, but you could also try commenting it out and run the code. If you find anyother faster way to scrape please let me know in the comments. I had initially referenced the code by François St-Amant.

Link — https://towardsdatascience.com/how-to-scrape-youtube-comments-with-python-61ff197115d

import pandas as pd   
df = pd.DataFrame(data, columns=['comment'])
df.head()

So now, we have the list of perfumes mentioned in the video taken as a second dataframe with Perplexity.ai, I have taken out this list manually but there are ways to do this by parsing the video with python but lets save that for another day. You could also check out the Youtube video summarizer for this which my friend Priyanshu Shukla built — https://medium.com/@priyanshu-shkl7/implementing-generative-ai-into-your-apps-web-scraping-with-genai-f08711a404cb

Now moving to the NLP part of things. Keeping things simple, I’m trying to get all the mentions of the different fragrances of Tom ford that was mentioned within the comments. Basically attributing each fragrance based on mentions and ordering them by most activity.

To achieve this there are multiple ways to do this, I have used NLTK (Natural Language Toolkit) is a powerful library for working with human language data in Python. It’s widely used for natural language processing (NLP) tasks such as semantic keyword matching and sentiment analysis.

We could also use semantic string matching using any best available sentence transformers from HuggingFace or we could use any of the available LLMs to do the same (but for production deployments, for large datasets sentence transformers would suit the best).

!pip install nltk
import nltk
nltk.download('vader_lexicon')
import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer

def calculate_sentiment_scores(comments, perfumes):
sentiment_scores = {str(perfume): [] for perfume in perfumes}

sid = SentimentIntensityAnalyzer()

for _, row in comments.iterrows():
comment = str(row['Comment'])
user_id = row['User ID']
sentiment_dict = sid.polarity_scores(comment)

compound_score = sentiment_dict['compound']

for perfume in perfumes:
if str(perfume).lower() in comment.lower():
sentiment_scores[str(perfume)].append((user_id, comment, compound_score))

return sentiment_scores

# Load the comments DataFrame
df_comments = df

# Load the perfumes DataFrame
df_perfumes = df2

# Extract perfumes from df_perfumes
perfumes = df2.iloc[:, 0].tolist()

# Calculate sentiment scores for each perfume in the comments
sentiment_scores = calculate_sentiment_scores(df_comments, perfumes)

# Create a new DataFrame with 'user_id', 'comment', 'perfume', and 'sentiment_score' columns
data = []
for perfume, user_comment_list in sentiment_scores.items():
for user_id, comment, sentiment_score in user_comment_list:
data.append([user_id, comment, perfume, sentiment_score])

df_result = pd.DataFrame(data, columns=['user_id', 'comment', 'perfume', 'sentiment_score'])

# Print the resulting DataFrame
print("User Comments with Sentiment Scores:")
df_result

As you see above, each comment has sometimes one or more perfumes mentioned, so score is calculated for both the perfumes. Thats why, “I own black orchid”… comment has Beau de jour. So it comes up twice, one for Beau de jour and one for black orchid.

But theres more to this, lets say if we need to verify this sentiment analysis and keyword matching, for this i’ve built a simple dashboard with Tableau that lets us quickly filter through the perfumes and their sentiment scores.

https://public.tableau.com/views/YoutubeSentimentAnalysis/YoutubeComments-NLPAnalysis?:language=en-GB&:sid=&:display_count=n&:origin=viz_share_link

Please feel free to share your thoughts, or comments and follow me! Thanks for reading!

--

--

Ashwin Singaram

Passionate about Data analytics engineering | MS MIS 24' Graduate from UB