YouTube View Prediction with Machine Learning
Before you make an effort and create a video, do you like to get an idea of how many views you are going to get from the video?
This article will guide you on how to create a model and predict YouTube views for a video that does not already exist on YouTube.
As we all know using and watching YouTube videos is an important part of our everyday lives. Most people are trying to build their influence, income, and impact with YouTube and online video. In nutshell, everyone is trying to be a YouTube influencer. It will be nice if a YouTube influencer can get an idea of how the view count is going to be before making and finalizing the video. In here we tried to create a model that can help influencers to predict the number of views for their next video.
You can go through the steps below and come up with a model to predict the view count.
1. Collecting Data
In our model, data was collected from a kaggle data set which containing data on daily trending YouTube videos. This is the most relevant data set currently available for use. If you want you can create a data set from the scratch, by scrapping data through YouTube APIs.
This data set includes several months of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, FR , RU, MX, KR, JP and IN regions (USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan, and India respectively) over the same time period. Each region’s data is in a separate file.
To train our model we used data related to the USA. It includes in the USvideos.csv file. This data set contains a total of 16 columns as follows.
1. Feature Engineering and Data Preparation
1.1 Data Exploration
Feature Engineering plays an important role in data preparation, which is a key component of the AI workflow. There are some common feature engineering steps that can be done in data preparation such as filling missing values, one code encoding, etc. Apart from that, there are many different ways to optimize data. Some of them are, removing unnecessary columns, normalization, column aggregation, row aggregation, generating new features, etc.
Our next step is to prepare data in a way that we can use to train and get a good model. First of all, we need to look at the current data and analyze them before going further. We chose the data from USvideos.csv. There are a total of 40 949 data rows in this CSV file. Next, we need to find out the distribution of features along these 40 949 data rows.
category_id -This is the first feature we going to analyze. There are different types of categories on YouTube, and when uploading a video, it is compulsory to select the category of your video. In this data set, category_id refers to the category and that mapping can be found in a JSON file separately. Since this is a mandatory requirement in YouTube, there are no missing values for this feature in the data set. From the graph below, you can get a rough idea of how category id has distributed.
publish_time -Publish time for the video available in this data set. The time format of these data is UTC. When analyzing the time that the videos are published in this data set, they have data from 23Jul06 to 14Jun18. But the mean value for this feature is 11Feb18. From this, we can get the idea that there is a large amount of data available for the year 2018. You can see the graph below to get an idea of how the published year was distributed in this data set.
You can see that most of the data related to 2016–2018 from the graph above. There are no missing values in the data set for this feature.
views -This is the one we going to predict. Analyzing the distribution of the prediction column value in the data set is a must. The below graph will give you a graphical idea of the view count distribution of the videos in our data set.
According to the distribution, you can see that view count distribution is below 50 million for more than 40 000 records in our data set. When we analyze the view count of our data set with numbers, we can see the minimum view count is 549 and the maximum is 225 million. But the mean value for the view count is 2.6 million and there is a higher standard deviation of 7.39 million.
1.2 Feature Selection
When it comes to Feature Engineering, selecting features and dropping out unnecessary columns in the data set is a must. So as the first step to feature selection we can drop out the unnecessary columns in our data set like video_id, thumbnail_link, etc. In our case there are some special columns we have to drop since those are cannot apply for the prediction we going to do. We doing to predict view count for a video that still doesn’t exist, so we cannot get likes, dislikes, comment count as features for our model generation. so we have to drop them also.
The next valuable features in our data set are trending_date and publish_time. Those two are in two different time formats so the first thing we have to do is convert those two to the same time format. But if we get these two directly as features it will not give us a better performance. What we want is the difference between these two times. Because that is the duration taken to get the mentioned view count. We can generate many new features from these two features. All generated features from these two features are shown in the below diagram.
So after doing all the above feature engineering things, available columns in our data set shown below. These are the columns that we will take into account when we create our training data set.
1.3 Data Selection
Since this data set consists of 40 949 data rows and there are skews of the data for some features we have to select a subset of data for our training. Creating two separate data sets is another special thing we did. We created two data set for training with and without a description column. Those data selection steps are mentioned below with respect to the data row count remains after dropping them.
2. Training
Now we have two data sets we created and the next step is to train them and create a model with better performance. For training experiments and model creation we used a tool called Navigator. Navigator is a life savior tool and give the facility to automate the AI life cycle. I used this tool to do some feature engineering, training, model creation and deployment. You can create your AI Service in Navigator by doing the below steps.
This is the screen you will see after logging into the Navigator. From here, click on Create an AI Service. It will give a popup to give a name for your AI Service give any name you like. Then you will go to the below screen.
From here you can make your service in different ways. Because there are many options given here and you have the flexibility to change them. Here, showing the options to import the data sets. You can select a data set from your S3 bucket or the local machine. Here, showing S3 bucket as an option because I connected my AWS account to the navigator and I have the options using AWS features from here. This is the first one you will meet, getting a chance to working with Simple Storage Services.
In above image shows you the sequence of importing a data set from your local machine. Since we using a CSV data set, What type of data set do you have? option can be kept as default you don’t want to do anything there.
- select the option Local since we uploading our data set from the local machine. After selecting the Local it will give another option to choose.
- you have to specify the bucket you want to upload this data set. Since we working with AWS, Navigator importing our data set to an S3 bucket first and then use for the training. You can choose any bucket you like from your AWS account.
- after selecting the bucket click Upload. data will be available on the table after clicking upload.
- at the end, you have to click on Import Dataset. If it becomes a success it will look like below.
Then we can go to our next step in Navigator, Feature Engineering. Navigator provides the facility of doing feature engineering and it’s creating train and test data sets using our imported data set. Here, Navigator does the feature engineering which needed to support Amazon SageMaker Training as well.
5. go to the Feature Engineering tab.
6. from the data set selection options for training the algorithm select Use the imported dataset.
7. then you can choose the prediction label according to your data set columns. all the data set columns which can be prediction label available in this drop-down and you have to select the one you going to predict. This will be needed because, in Amazon Sagemaker label column should be available as the first column in the train data set. So that feature engineering will be automatically done by Navigator by using our input as the prediction label.
8. then you have to specify the problem type. Since here we predicting numbers and not do a classification this should be a regression problem. So choose Regression from the drop-down.
9. then click on Apply. After finishing applying process your screen will be looks like below.
From here you can get an idea about your data set and it shows what feature engineering techniques applied to your data set. After completing this section we can move to the next section, Train. From here you can do the training using the prepared data set. In this section also Navigator gives various options and you have the flexibility to change them. In the below screenshot, it shows the options available and default options Navigator suggests for them.
You can do multiple training by changing those options and compare the performance of each model. All these facilities are provided by Navigator. After changing options you can click on Train. If you select Automatic from the How would you like to train? option, then Navigator will automatically deploy the best performance model. After completing the training you will get the below screen.
Then you can move to Monitor section in the Navigator. It has a lot of services to monitor your AI Service. In there you can get Endpoint URL and connect this AI Service to any of your applications.
3. Metrics
You can do training as much as you want with different options in Navigator and analyze the performance of them. The best performance achieved for both data sets are mentioned below.
According to those evaluations you can see best performance gets to the data set which description is not included. Random Forest Regression Algorithm is the best one with hyperparameter tuning of num_trees : 10 and max_depth : 5
RandomForestRegressor
A random forest is a meta-estimator (i.e. it combines the result of multiple predictions) which aggregates many decision trees. It operates by constructing a multitude of decision trees at training time and outputting mean prediction of the individual trees. It can handle thousands of input variables without variable deletion. Algorithm performance depends on the number of decision trees and the depth. We can do hyperparameter tuning by changing those two factors.
4. Integration
After creating our AI Service we need to connect it with an App. For that, we need a form that getting information regarding the video going to do. We can create a simple form in web as below.
Then you have to call the Endpoint URL which available in Navigator to get the prediction. Code related to integration and web form available in GitHub. Here if you notice it getting input as publishing date and prediction date. From those two we have to generate the features which we generated in the feature engineering section. That same feature engineering part should be done here as well.
Full code available here for prediction and feature engineering.
5. Future work
There are many things we can do to increase model performance and those can be listed as follows.
- Since This data set does not contain channel specific data(subscriber count, etc) it’s performance is low. We can add them and train again.
- Current view count always depends on the previous video data. Need to include the number of views get to the video previously uploaded.
- Since we predicting future views it will be nice to have a dataset with the most recent time, like 2019 and 2020.
- Can scrape more data using the YouTube API directly and integrate them to create a better data set than the existing one.
- Current data set include detail about trending video data, not sensitive to a small number of views prediction. Better to scrape data related to normal videos and include them also.
At last, what need to keep in mind is we used a data set of trending data for creating the model. Because of that predicting view count only be perform well for channels that targeting trending. For normal channels and fresh channels will not be able to get a good performance here. But we can improve this model to that level.
All resources related to this (Data sets, Feature Engineering code, Web UI code, Integration) available here.
Reference
- https://www.kaggle.com/datasnaek/youtube-new
- https://towardsdatascience.com/youtube-views-predictor-9ec573090acb
- https://developers.google.com/youtube/v3/docs
- https://navigator.pyxeda.ai
- https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f
- https://github.com/YDulanjani/YouTubeViewPredictor/