YouTube comments Spam Classifier

MR. BUCK MAN
4 min readMay 11, 2019

--

Looking at the image(image source: YOUTUBE) you might have easily guessed. It’s SPAM on YOUTUBE comments to be precise.

This article is all about how we can use machine learning algorithms to classify youtube comments as SPAM OR HAM(not spam).

Let’s take up a case study and dive deeper for better understanding.

Note: all the code is in python and you can find the link of the github at the end.

The data set consists of comments made on top trending videos and stored in five CSV files. The dataset is downloaded from UCI repo. The main objective of the problem is to classify the comments into spam or not. Let’s Begin…

The very first step is to import the necessary python modules.

Now, we shall read all the CSV file with the help of pandas function. Here i named A,B,C,D and E for ease.

Now we need to combine all the csv files to a single dataframe. In python we can do this with the concat function as below.

Now we shall peep into the data and get to know about it.

let us check the length of the train dataset now.After running the below code we found that there are 1956 records.

Let’s get into the cleaning stuff. We need to remove the unnecessary columns from the dataset. We are dropping “comment_id”, “author” , “date” .

we shall look at the details of the dataset. We see that it contains “content” and “class” only.

Now lets assign the features and labels to the dataset

After assigning, now is the time to split the data into training and testing dataset. We can do this with the train_test_split function from the sklearn.

After splitting the data into training and test datasets. We need to use countvectorizer inorder to Convert a collection of text to a matrix of token counts.This will help in the later computation.Also we shall fit and transform the X_train dataset.

Now comes the interesting part where in we shall count the number of frequency of the words in the dataset. In order to compute the frequency of the words we are using the method TF and TF-IDF

TF stands for Term Frequency

TF-IDF stands for Term Frequency Inverse Document Frequency

Now comes the modeling part, In this case we are using the random forest for classification. After the modeling is done we shall calculate the accuracy of the model and see the results.

We solved the problem with the f1 score of 95 %.

If you wish you can further apply different algorithms to improve the model and create a web app using the Flask.

Thanks for reading.You can find the link of the github here which contains all the source files and the python notebook.

--

--