Predicting YouTube Dislikes using Machine Learning

Ramtin Ardeshirifar
7 min readMar 9, 2022
Image by Author, inspired by Christian Wiediger on Unsplash.

YouTube has more than one billion monthly users who watch over a billion hours of video per day. Users can show their interest in videos by pressing the like and dislike buttons. However, YouTube has removed the ability to view dislikes as of December 13th, 2021. YouTube’s action sparked some controversy; some say that public dislikes can negatively impact content creators, while others claim that users should be aware of video counts. In this article, I used Catboost for training a model on the numerical features of every YouTube video (e.g., the number of views, comments, likes, etc.) along with sentiment analysis of the video descriptions and comments using the VADER sentiment analysis model.

Introduction

YouTube is a video-sharing and social media platform owned by Google. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim and is the second most visited website after Google. Prior to 2010, YouTube used a rating system based on stars, where users rated videos from 1 to 5 (1 being bad, 5 being great). Over time, YouTube staff pointed out that 2, 3, and 4-star ratings were used much less frequently than 1 and 5. Therefore, it made sense to switch to a simple like-or-dislike ratio we are familiar with today.

YouTube provides a dislike button for users to express their feelings about a video. However, dislikes are no longer visible to the public after December 13th, 2021. YouTube has stated that this move is to better protect the creators from harassment and reduce dislike attacks. Although, some users claim that the public should be aware of video counts before watching a video and not just rely on the like count. There are many videos on YouTube, and some of these videos may not be of high quality or maybe misleading, resulting in them receiving more dislikes than others. By observing how many dislikes a video has before watching it, users could gain insight into these videos.

For this article, I trained a model which predicts how many dislikes a video has. In order to analyze non-numerical features like video titles, descriptions, and comments, a sentiment analysis approach using the VADER sentiment [1] has been applied, which translated them to a set of numbers about how much it conveys a positive, negative, or neutral opinion.

Related works

As YouTube has recently begun hiding dislikes, few studies have been conducted directly on this topic.

The report “How useful are your comments?: Analyzing and Predicting YouTube Comments and Comment Ratings [2]” studies the correlation between comment sentiment and comment rating, eg. like or dislike on the comment itself. They concluded that it is indeed possible to create a classifier that accurately predicts which comments are useful. Commenters who use discriminatory language are not considered useful.

A study with the title of “Predicting like-ratio on YouTube videos using sentiment analysis on comments [3]” has been done by Hyberg & Isaacs to investigate whether there is a correlation between the emotional sentiment of the comments on a YouTube video and the actual like to dislike ratio of the video. This problem is seen as a sort of classification problem in the study. Video comments are divided into only two classes, and then the ratio of positive and negative comments is compared to the like/dislike ratio of videos.

In the Medium article entitled “Predicting the Number of Dislikes on YouTube Videos,” Dmytro Nikolaiev (Dimid), tried to predict the number of dislikes by utilizing neural networks including bi-directional LSTMs to analyze textual features of video descriptions and comments and merge them with numerical features (e.g., the number of views and likes). Even though we used the same dataset in this article, the methods used were totally different. The reported average absolute error (MAE) of validation data of that article were 6200.

A Google Chrome web browser extension, called “Return YouTube Dislike” uses a combination of archived data from before the official YouTube dislikes API shut down and extrapolated extension user behavior. Machine learning methods are not used in this extension, and it only relies on archived dislikes datasets and users who dislike a video while this extension is installed in order to update other users.

Materials and methods

Data collection

The “YouTube Dislike Dataset” from Kaggle, which is collected by Nikolaiev using YouTube Data API is used. Trending YouTube videos from August 2020 to December 2021 for the United States, Canada, and Great Britain are included in this dataset. The columns’ name and their type of data, and their descriptions, are presented in the following table:

Data preprocessing

The aim of this research is to predict YouTube dislikes. We began by removing columns that aren’t needed because these columns do not represent much information that we can use; this includes removing ‘title’, ‘video_id’, ‘channel_title’, ‘channel_id’ columns. Then, by using the Texthero library, we clean the ‘ description’, ‘comments’, and ‘tags’ text columns. Texthero’s clean pipeline process consists of filling null values, converting all upper case to lower case, and removing digits, punctuation, diacritics, stopwords, and whitespaces from the text.

Sentiment analysis on texts

I used VADER Sentiment Analysis to analyze text sentiment. VADER, short for Value Aware Dictionary and Sentiment Reasoner, is a lexicon and rule-based sentiment analysis tool for social media. Using VADER, we analyze the comment, tag, and description of each video to determine its polarity. This method gives us four values for each feature: positive, neutral, negative, and compound sentiment scores. The compound score is the sum of the positive, negative, and neutral scores, which is then normalized between -1 (most extreme negative) and +1 (most extreme positive). In general, the closer the Compound score is to +1, the more positive the text is. We will merge all 16 columns into one data frame and use all of these scores for training our model.

Merging Data

Using the sentiment analysis data frame we produced in the previous step along with selected features from the original dataset, we will create a single data frame.

Training the model

For creating and training a machine learning model, we are going to use Catboost. Catboost is an open-source software library which provides a gradient boosting framework.

Results/Discussion

To estimate our model performance, we use K-fold cross-validation. Cross-validation is a method of testing and training a model using different parts of the data on different iterations. Most often, this technique is used for predicting the performance of a predictive model in practice and for estimating its accuracy.

Image by Gufosowa from Wikimedia Commons

During the procedure, a single parameter called K determines how many groups a given data sample will be divided into. Therefore, it is commonly referred to as K-fold cross-validation.

This method is popular because it is simple to understand and generally results in less biased estimation of model skill than other methods, such as a simple train/test split. In our work, we will use 5-fold cross-validation.

From this approach, I was able to get an MAE around 2,800. Considering some videos have hundreds of thousands of dislikes, this is a pretty good result.

The relationship between MAE and video dislike quantity

The relationship between MAE and video views quantity

We can also plot the SHAP values of each feature for every sample in order to get an overview of which features are most important for the model. In the plot below, features are sorted by the sum of SHAP value magnitudes across all samples, and SHAP values are used to show the distribution of the impacts each feature has on the model output. Colors represent feature values (red is high, blue is low). As a result, for example, a high comment_count (the number of comments for each video) increases predicted video dislikes.

The impact of each feature on the model output.

For having a cleaner plot, I removed outliers from the training dataset using Z-scores.

The notebook and all codes of this article are available on GitHub.

Update: A new notebook in which I used the bag-of-words (BoW) model is added to the GitHub repository. This technique showed a lower MAE of around 2,700 which is more accurate.

References

[1] C. Hutto, and E. Gilbert. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text (2014), Eighth International Conference on Weblogs and Social Media (ICWSM-14)

[2] S.Siersdorfer, S. Chelaru, W. Nejdl, and J. San Pedro. How useful are your comments? analyzing and predicting youtube comments and comment ratings (2010), In Proceedings of the 19th international conference on World wide web (WWW ‘10). Association for Computing Machinery, New York, NY, USA

[3] M. Hyberg,and T. Isaacs, Predicting like-ratio on YouTube videos using sentiment analysis on comments (2018), Digitala Vetenskapliga Arkivet

--

--

Ramtin Ardeshirifar

Data scientist, Android developer/ I mostly write about data science projects I work on/ ramtin.cc / GitHub: @ramtiin