Youtube Views prediction for Content Creators

Mayuresh Kadam
Analytics Vidhya
Published in
7 min readMay 24, 2020

A self starter project created using Data Science techniques to predict views for your Youtube videos

Being a content creator on Youtube is indeed challenging and if you are a Youtuber who has just joined on the platform(just like me) the chances are you might be getting a lot less views than what you thought. This project would help my fellow Youtubers who would want to know how many views would their video get. Before we begin,it would be great if you could checkout my channel and subscribe if you like the videos as it helps with the algorithm.

To begin with,I was faced with the challenge to collect the data. I could’ve collected the data using a Webscraper-If you want to know how to scrape do checkout my article here or use the Youtube 8M Dataset provided by Google. But to cultivate new skills I have created a Synthetic Dataset with approx 10k records.

What is a Synthetic Dataset?

As the name suggests,a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. Its main purpose,is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. Desired properties are,

  • It can be numerical, binary, or categorical (ordinal or non-ordinal),
  • The number of features and length of the dataset should be arbitrary
  • It should preferably be random and the user should be able to choose a wide variety of statistical distribution to base this data upon i.e. the underlying random process can be precisely controlled and tuned,
  • If it is used for classification algorithms, then the degree of class separation should be controllable to make the learning problem easy or hard,
  • Random noise can be interjected in a controllable manner
  • For a regression problem, a complex, non-linear generative process can be used for sourcing the data

there are a few libraries in Python by which we can create synthetic datasets viz Faker and Mimesis. Both ensure in getting your tasks done however when it comes to speed and memory execution and complete control over data Mimesis wins hands down. The dataset has been created using Mimesis package. visit Faker and Mimesis to know more.

Lets begin by importing all the necessary libraries and instantiating various classes to generate synthetic data..

As you might have noticed that I’ve used a CSV file, you might be wondering why is he creating synthetic data as he already has data with him. To answer this, the core idea of the project was to let smaller youtube creators get an insight about the views prediction. Moving further, I only selected the category names with “Music” and “Entertainment” and joined both to get a new dataframe. Along with that a new dataframe was created which would store videos of those channels who had videos less than 8 (since I wanted to created a dataset which would be suitable to small youtubers).

the least interesting feature of this data was the country and location of the video so I removed it altogether.

As you can see, data for category_id and category_name column is being replicated for 10k times since I didn’t want category other than “Music” and “Entertainment”.

Memesis is put to work by creating “Profile_urls”, “Trailer_url”, and “Picture_Url”. Since the data was already taken for videos which were less than 8,those were not enough and hence there was a need to create them again. Another CSV file is being used to remove the “tags” and video_id” section. These fields were missing from the earlier CSV file because of which another CSV file and “tags” were taken and it was something which needed to be relevant with the video,upon creating the synthetic data for this field I realised that the accuracy was being hampered therefore I took something which was real.

:1000 denotes only 10k rows were being accessed

To complete the process of creating a Synthetic dataset a function was created which would cover the remaining fields of data. As a Small Youtuber might not have subscribers(followers) more than 2k I thought to set the threshold to 150. The rest of the working is pretty basic.

Now that the synthetic dataset is completed successfully, lets begin with creating Train,Validation and Test data. The dataset was split into 3 in 80%-10%-10% ratio where most of the data is being trained on 80% and rest is being validated and tested.

Data Visualization: Visualizing the tags

Inorder to visualize any data you do data cleaning. Here it was done on all three sets(train,validation,test) which brought us to..

Beautiful,isn’t it?

Data Modelling: Combining a Random Forest Regressor with an MLP classifier and averaging

Text data requires special preparation before you can start using it for predictive modeling.The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use, as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

We are using the Countvectorizer to create a Bag-of-words model. The countvectorizer comes with its own options to automatically do preprocessing tokenization and stop words removal. Numpy arrays are easy to work with hence the result was converted to an array. The FitTransform does two things here- firstly it fits the model and learns the vocabulary and secondly it transforms our training data into feature vectors.

I am using Random Forest and a Neural Net to predict the views.

A Random Forest is a flexible, easy to use machine learning algorithm that produces great results most of the time with minimum time spent on hyper-parameter tuning. It has gained popularity due to its simplicity and the fact that it can be used for both classification and regression tasks.

Multi-layer Perceptron classifier which in the name itself connects to a Neural Network. An MLPClassifier relies on an underlying Neural Network to perform the task of classification. The model optimizes the log-loss function using stochastic gradient descent.

here’s the prediction output.

The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors. The mean square error that we have obtained is-

that’s too high!

We’ve used another ensemble method called as averaging. A modeling averaging ensemble combines the prediction from each model equally and often results in better performance on average than a given single model. A weighted average ensemble is an approach that allows multiple models to contribute to a prediction in proportion to their trust or estimated performance.

Final Step: Using Test data to evaluate the performance of the model

Yay! We got the predictions!! Now that we have data,the model and the predictions the next step is to implement it on the Web so that it can be put to use for my fellow Youtubers. A way that I’ve thought is by saving the model weights or using a pickle. What are your thoughts on this? Leave a comment down below. You can Find my Jupyter Notebook for this example on my Github . Thanks for reading!

--

--

Mayuresh Kadam
Analytics Vidhya

Data Science enthusiast who happens to be a Data Engineer.