Detecting spam comments on YouTube using Machine Learning

12 min readAug 30, 2020

Use of Bag of words technique & Random Forest algorithm to identify spam comments

As, you are on this page, I am assuming that you have completed your Machine Learning course & further looking to implement your skills.

Well then, “YouTube spam comment detector” is a great way to start & get your hands dirty.

PRE-REQUISITES

> Familiarity with Python
> Working knowledge of Random Forest algorithm and Bag of words model will be a plus.
In any case, I will be explaining these terms as we move ahead

THE DATA SET

The dataset is pretty straightforward, it contains 2,000 comments from popular Youtube videos, The dataset is formatted in a way where each row has a comment followed by a value marked as 1 or 0 for spam or not spam,

The dataset can be downloaded from this website

http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection

While on the website, click on the Data Folder directory,
Scrolling below will give a brief description of the dataset.

Further ahead, click on the 2nd directory i.e, Youtube-Spam-Collection-v1 Extract all the files in a folder.

I recommend you to store all the data files & the python file (.py or .ipynb) in the same folder, so further, it gets easier for you to fetch the files, while writing the code.

The coding part will be performed in Spyder, although you are free to choose an IDE of your choice as long as it supports Python.

Bag of words

The bag-of-words model does exactly we want, that is to convert the phrases or sentences and counts the number of times a similar word appears. In the world of computer science, a bag refers to a data structure that keeps track of objects like an array or list does, but in such cases the order does not matter and if an object appears more than once, we just keep track of the count rather we keep repeating them.

Consider the following sentences and try to find what makes the first pair of phrases similar to the second pair:

As you can see, the first phrase from the diagram, has a bag of words that contains words such as “channel”, with one occurrence, “plz”, with one occurrence, “subscribe”, two occurrences, and so on. Then, we would collect all these counts in a vector, where one vector per phrase or sentence or document, depending on what you are working with. Again, the order in which the words appeared originally doesn’t matter.

Further ahead, We make a larger vector with all the unique words across both phrases, we get a proper matrix representation. With each row representing a different phrase, notice the use of 0 to indicate that a phrase doesn’t have a word:

If you want to have a bag of words with lots of phrases, documents, or we would need to collect all the unique words that occur across all the examples and create a huge matrix, N x M, where N is the number of examples and M is the number of occurrences

Additionally, there are some points which we need to take care about before preparing a bag of words model
* Lowercase every word
* Drop punctuation
* Drop very common words (stop words)
* Remove plurals (for example, bunnies => bunny)
* Perform Lemmatization (for example, reader => read, reading = read)
* Keep only frequent words (for example, must appear in >10 examples)
* Record binary counts (1 = present, 0 = absent) rather than true counts

Furthermore, if we still wanted to reduce very common words and highlight the rare ones, what we would need to do is record the relative importance of each word rather than its raw count. This is known as term frequency inverse document frequency (TF-IDF), which measures how common a word or term is in the document.

Random Forest Algorithm

Random forests are extensions of decision trees and are a kind of ensemble method.
Ensemble methods can achieve high accuracy by building several classifiers and running a each one independently. When a classifier makes a decision, you can make use of the most common and the average decision. If we use the most common method, it is called voting.

Here’s a diagram depicting the ensemble method:

A random forest is a collection or ensemble of decision trees. Each tree is trained on a random subset of the attributes, as shown in the following diagram:

Photo by Abhishek Sharma from Analytics Vidhya

Consider using a random forest when there is a sufficient number of attributes to make trees and the accuracy is paramount. When there are fewer trees, the interpretability is difficult compared to a single decision tree.

Preparing our code

Importing the dataset

First, we will import a single dataset. This dataset is actually split into four different files. Our set of comments comes from the PSY-Gangnam Style video:

import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(‘D:\\Machine Learning_Algoritms\\Youtube-Spam-Check\\Youtube01-Psy.csv’, encoding =’latin1')

Now, if you are using Spyder, in the ‘Variable Explorer’ tab, you can check, if a variable ‘df ’ is created and assigned with the dataset, similiar to the one in the diagram below:

Checking for null values

##checking for all the null values
df.isnull().sum()

Luckily, there are no missing values, so we can proceed ahead.

Look for the category distribution in categorical columns

Let’s look at the count of how many rows in the dataset are spam and how many are not spam

##category distribution
df[‘CLASS’].value_counts()

The result we acquired is 175 and 175 respectively, which sums up to 350 rows overall in this file.

Bag of Words Technique

In scikit-learn, the bag of words technique is actually called ‘CountVectorizer’, which means counting how many times each word appears and puts them into a vector. To create a vector, we need to make an object for ‘CountVectorizer’, and then perform the fit and transform simultaneously:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
dv = vectorizer.fit_transform(df[‘CONTENT’])

This is performed in two different steps. First comes the fit step, where it discovers which words are present in the dataset, and second is the transform step, which gives you the bag of words matrix for those phrases

dv

Image by Author

There are 350 rows, which means we have 350 different comments and 1,482 words.

We can use the vectorizer feature to find out which word the dataset found after vectorizing.

vectorizer.get_feature_names()

The result found after vectorizing is it starts with numbers and ends with regular words.

Shuffling the dataset

The following command shuffles the dataset with fraction 100% that is adding frac=1:

dshuf = df.sample(frac = 1)

Splitting the Dataset

We will split the dataset into training and testing sets. Let’s assume that the first 300 will be for training, while the latter 50 will be for testing:

dtrain = dshuf[:300]
dtest = dshuf[300:]
dtrain_att = vectorizer.fit_transform(dtrain[‘CONTENT’])
dtest_att = vectorizer.transform(dtest[‘CONTENT’])
dtrain_label = dtrain[‘CLASS’]
dtest_label = dtest[‘CLASS’]

In the preceding code, ‘vectorizer.fit_transform(dtrain[‘CONTENT’])’ is an important step. At that stage, you have a training set that you want to perform a fit transform on, which means it will learn the words and also produce the matrix. However, for the testing set, we don’t perform a fit transform again, since we don’t want the model to learn different words for the testing data

Building the Random Forest Classifier

We will begin with the building of the random forest classifier. We will be converting this dataset into 80 different trees and we will fit the training set so that we can score its performance on the testing set:

from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators = 80, random_state = 0)
RFC.fit(dtrain_att, dtrain_label)

Image by Author

RFC.score(dtrain_att, dtrain_label)

Image by Author

The output of the score is 97.33%, That’s a really good score.

Performing a Confusion matrix to check for the number of correct responses:

from sklearn.metrics import confusion_matrix
y_pred = RFC.predict(dtrain_att)
confusion_matrix(y_pred, dtrain_label)

As you can see, we have a total of 292 correct predictions out of 300, That’s a prety good accuracy.

We need be sure that the accuracy remains high ; for that, we will perform a cross validation with five different splits.

Cross-Validation

To perform a cross validation, we will use all the training data and let it split it into four different groups: 20%, 80%, and 20% will be testing data, and 80% will be the training data:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(RFC, dtrain_att, dtrain_label, cv=5)
scores.mean()

After performing an average to the scores that we just obtained, we receive an accuracy of 88.66%.

Loading all the datasets

Now, we will load all the datasets & combine them

df = pd.concat([pd.read_csv(‘D:\\Machine Learning_Algoritms\\Youtube-Spam-Check\\Youtube01-Psy.csv’, encoding =’latin1'), pd.read_csv(‘D:\\Machine Learning_Algoritms\\Youtube-Spam-Check\\Youtube02-KatyPerry.csv’, encoding =’latin1'), pd.read_csv(‘D:\\Machine Learning_Algoritms\\Youtube-Spam-Check\\Youtube03-LMFAO.csv’, encoding =’latin1'), pd.read_csv(‘D:\\Machine Learning_Algoritms\\Youtube-Spam-Check\\Youtube04-Eminem.csv’, encoding =’latin1'), pd.read_csv(‘D:\\Machine Learning_Algoritms\\Youtube-Spam-Check\\Youtube05-Shakira.csv’, encoding =’latin1')])

The entire dataset has five different videos with comments, which means all together we have around 2,000 rows. On checking all the comments, we noticed that there are 1005 spam comments and 951 not-spam comments, that quite close enough to split it in to even parts:

df[‘CLASS’].value_counts()

Further, we will shuffle the entire dataset and separate the comments and the answers:

dshuf = df.sample(frac=1)
d_content = dshuf[‘CONTENT’]
d_label = dshuf[‘CLASS’]

We need to perform a couple of steps here with ‘CountVectorizer’ followed by the random forest. For this, we will use a feature in scikit-learn called a Pipeline. Pipeline is really convenient and will bring together two or more steps so that all the steps are treated as one. So, we will build a pipeline with the bag of words, and then use ‘CountVectorizer’ followed by the random forest classifier. Then we will print the pipeline, and it the steps required:

from sklearn.pipeline import Pipeline,make_pipeline
pl = Pipeline([
 (‘bag of words: ‘, CountVectorizer()),
 (‘Random Forest Classifier:’, RandomForestClassifier())])make_pipeline(CountVectorizer(), RandomForestClassifier())

We can let the pipeline name of each step by itself by adding ‘CountVectorizer’ in our ‘RandomForestClassifier’ and it will name them ‘countvectorizer’ and ‘randomforestclassifier’

Once the pipeline is created you can just call it fit and it will perform the rest that is first it perform the fit and then transform with the ‘CountVectorizer’, followed by a fit with the ‘RandomForestClassifier’ classifier. That’s the benefit of having a pipeline:

pl.fit(d_content[:1500],d_label[:1500])

Now you call score so that it knows that when we are scoring it will to run it through the bag of words ‘CountVectorizer’, followed by predicting with the ‘RandomForestClassifier’:

pl.score(d_content[:1500],d_label[:1500])

Image by Author

This whole procedure will produce a score of about 98.3%. We can only predict a single example with the pipeline. For example, imagine we have a new comment after the dataset has been trained, and we want to know whether the user has just typed this comment or whether it’s spam:

pl.predict([“What a nice video”])

Image by Author

As you can see, it has detected correctly.

pl.predict([“Plz subscribe my channel”])

Image by Author

We will use our pipeline to figure out how accurate our cross-validation was:

scores = cross_val_score(pl, d_content, d_label, cv=5)
scores.mean()

Image by Author

In this case, we find that the average accuracy was about 89.3%
It’s pretty good. Now let’s add TF-IDF to our model to make it more precise.

If you are not familiar with TF-IDF is, here is a brief description about it:

TF-IDF

The TfidfVectorizer is a feature that will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.
Its primary function is to evaluate how relevant a word is to a document in a collection of documents, which is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

Adding TF-IDF to our model:

from sklearn.feature_extraction.text import TfidfTransformer
pl_2 = make_pipeline(CountVectorizer(), TfidfTransformer(norm=None),
 RandomForestClassifier())

This will be placed after ‘CountVectorizer’. After we have produced the counts, we can then produce a TF-IDF score for these counts. Now we will add this in the pipeline and perform another cross-validation check with the same accuracy:

scores = cross_val_score(pl_2, d_content, d_label, cv=5)
scores.mean()

Image by Author

On adding TF-IDF, we receive an accuracy of 95.6% for our model.

There’s another feature of scikit-learn available that allows us to search all of the parameters and then it finds out what the best parameters are:

parameters = {
 ‘countvectorizer__max_features’: (1000, 2000),
 ‘countvectorizer__ngram_range’: ((1, 1), (1, 2)),
 ‘countvectorizer__stop_words’: (‘english’, None),
 ‘tfidftransformer__use_idf’: (True, False), 
 ‘randomforestclassifier__n_estimators’: (10, 30, 50)
 }

We can make a little dictionary where we say the name of the pipeline step and then mention what the parameter name would be and this gives us our options. For demonstration, we are going to try maximum number of words or maybe just a maximum of 1,000 or 2,000 words. Using ‘ngrams’, we can mention just single words or pairs of words that are stop words, use the English dictionary of stop words, or don’t use stop words, which means in the first case we need to get rid of common words, and in the second case we do not get rid of common words. Using TF-IDF, we use JEG to state whether it’s yes or no. The random forest we created uses 20, 50, or 100 trees. Using this, we can perform a grid search, which runs through all of the combinations of parameters and finds out what the best combination is. So, let’s give our pipeline number 2, which has the TF-IDF along with it.

To Check the list of available parameters in grid_search:

grid_search.estimator.get_params()

We will use ‘fit’ to perform the search:

grid_search.fit(d_content, d_label)

Since there is a large number of words, it takes a little while, around 50 seconds (in my case) , and ultimately finds the best parameters. We can get the best parameters out of the grid search and print them to see what the score is:

print(“Best accuracy:” , grid_search.best_score_)
print(“Best parameters: “)
best_params = grid_search.best_estimator_.get_params()
for name in sorted(best_params.keys()):
   print(‘{} : {}’.format(name, best_params[name]))

So, we got nearly 96% accuracy. We used around 1,000 words, only single words, used yes to get rid of stop words, had 30 trees in the random forest, and used yes and the IDF and the TF-IDF computation. Here we’ve demonstrated not only bag of words, TF-IDF, and random forest, but also the pipeline feature and the parameter search feature known as grid search.

If you want to look at my other projects, here is the GitHub repository:
https://github.com/AkshayLaddha943

Detecting spam comments on YouTube using Machine Learning

PRE-REQUISITES

THE DATA SET

Bag of words

Random Forest Algorithm

Preparing our code

Written by Akshay Laddha