Analysis & Prediction Of Dislikes On YouTube Data

Introduction:

Data Acquisition & Processing:

  1. Converting date columns to datetime format.
  2. Converting dicts to json string objects that are suitable to be imported into PostgreSQL for a JSONB formatted column.
  3. Calculates initial ratios including view_like_ratio, view_dislike_ratio, and like_dislike_ratio.
  4. Drops INF and NaN rows for the like_dislike_ratio to reduce dataset size and potential errors later on. We mainly only care about videos that actually have dislikes.
  5. Rename and Reorder columns to match database table schema.
  6. Loads data into PostgreSQL using batch importing utilizing the pd.to_sql command with multi mode enabled and a chunksize of 10,000 as default.
  • Exporting the top 500,000 most disliked rows based on dislike_like_ratio.
  • Exporting the top 500,000 most liked rows based on dislike_like ratio.
  • Exporting a 1% random sample resulting in over 800,000 rows.
  • Exporting a 10% random sample. This will be used for testing if a lot more data improves predictive performance.
  • Exporting a 0.2% random sample used for final testing.

Methodology:

Downloading Comment Data

Feature Engineering

  • LD Score: Likes / (Likes + Dislikes)
  • LD Score OHE: Converting decimal LD Score to categorical -1 (negative), 0 (neutral), and 1 (positive).
  • View_Like Ratio: view_count / like_count
  • View_Like Ratio Smoothed: If like_count is 0, we add 1 to view_count and like_count to avoid division by 0.
  • View_Dislike Ratio: view_count / dislike_count
  • Dislike-Like Ratio: dislike_count / like_count (smoothed by 1 to avoid division by 0)
  • NoCommentsBinary: 0 if the video had comments, and 1 if the video did not have comments when we attempted to pull the data.
  • VADER Sentiment Scores: (neg, neu, pos, compound)

Feature Selection

Preparing Data for Machine Learning

Model Testing

  1. Dummy Classifier: To get a baseline of performance. This model returned an accuracy score of 57%, a weighted F1 score of 58% and an MCC score of 0.0008 indicating no correlation.

Results & Discussion

Web App

  • The server takes in the video, and runs our YouTube API and web scraper for relevant data to fit the columns we have trained our model on.
  • We run the retrieved data through a similar processing pipeline that was used to train the model in order to generate a data frame suitable for model inference.
  • We load our trained Random Forest model (exported via our overall pipeline as a .joblib.pkl file) on the server.
  • Our model generates the prediction based on the data provided which we then showcase to the user along with other relevant video information.
Save The Dislikes Web Application

Future Direction

Conclusion

Statement of Work

Connect With Us

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store