Building a spam messages detector using machine learning

Anchit Jain
Data Science 101
Published in
4 min readMay 21, 2018

Over the past few weeks, I’ve been writing some of the most common machine learning algorithms from scratch here.

Now let’s implement them to solve some real world problems.

Let’s start with creating a simple Spam Messages Detector using Logistic Regression. Logistic regression is a simple method to classifying data — be it hot or not hot, spam or not spam.

Although, the are few Python libraries like Scikit, that can simply ask you to feed input and library can perform the entire operation of classification of our own- but to get a firm understanding of what happens under the hood, I , in my last blog have tried to explain the working of logistic regression using iris data set. So before you scroll your eye down please read how to implement logistic regression from scratch here

Assuming that you are no more tyro to logistic regression we will begin with data set. The dataset we have used is SMS Spam Collection Dataset from Kaggle. We have our training data in two columns(type and text).

Before we start with coding stuff, let us first analyse the data set.In our first column we have classified our data as spam and ham(not spam). It can be easily seen from the above image that text is in the form of string and we need to first vectorize the string in the numerical list of list, so basically this is used for learning vector representations of words.

What the hell is vectorization?

Most of the machine learning algorithms (read all), like logistic regression, understands only numbers. We need to figure out a way to convert a text sentence into numbers.

Vectorization is a simple process of converting a text sentence into a array of numbers.

To do this, we need to create a dictionary of all unique words from the text column as key and indexing from zero as value.While iterating each value from text column (which is an array) and fetching the position (value) of matched word (key) from dictionary.We need to create another array of zero’s and simultaneously updating the position of this zero array corresponding to matched key from our dictionary.

For example : I have a string say, ”Hello I am Anchit, and I love machine learning”, our unique dictionary and vectorize form of our input say “Anchit loves to travel” would looks like the below dummy code

The python implementation part is much easier.

Now we have vectorize input in our hand,using this vector as an input we will be working on following methods to predict the output from our model.

  1. Creating matrices :

This should be pretty routine by now. We append a one’s array to our matrix . Then we concatenate an array of ones to X.

2. Defining Sigmoid and loss function :

Speaking theoretically “ sigmoid” is a function that takes inputs and to generate probabilities, logistic regression uses a function that gives outputs between 0 and 1 for all values of X and we have another function named “loss or cost function” that consists of parameters or weights we initialize them randomly but we keep them updating with our learning rate and we will use gradient descent to minimize our cost.Following lines of code will guide you further.

3. Gradient descent :

Now, to understand gradient descent, let us imagine the path of the river originating from the top of the mountain. The job of gradient descent is exactly that of what the river aims to achieve; to reach the bottom-most point of the mountain. Now, as we know there is a gravitational force on the earth and hence the river will keep flowing downwards until it reaches the foothill.

In a similar way we will iterate our sigmoid function several times probably in thousands and simultaneously using the output of sigmoid function in loss function we will minimize the loss function.Below code will help you out in better understanding.

4. Prediction :

This is probably the most crucial part of our model where we predict our model.Here we need to understand our prediction, since we have set of input (X) and theta’s (self.theta) we just need to focus on ` X @ theta.T ` which is a matrix operation of X and theta. It does not matter how many columns are there in X or theta, as long as theta and X have the same number of columns the code will work. We need to set a threshold variable at 0.5 for which values greater than 0.5 are classified in class 1 and values less than 0.5 are classified in class 0.

6. Evaluation.

Last but not the least it is always important to know our accuracy of model so we can calculate the accuracy percentage of our model by passing the text and type column in this function and predicting the new type (predicted type) and then calculating the percentage using these features. Don’t you worry following lines of code will clear your more.

Meanwhile you will be reading this line, I hope you are cleared with the process of mail classification. You can access the full code from here.

A few things to note about the limitations of my code:

  1. My code fails to understand the similarity of words like man and men, bed and beds etc. This can be solved using stemming and lemmatization.
  2. The classification accuracy can be improved by using more sophisticated algorithms like decision trees or neural networks
  3. Word2Vec can also drastically improve accuracy.

Thank you for your patience……Claps (Echoing)

--

--

Anchit Jain
Data Science 101

Machine learning engineer. Loves to work on Deep learning based Image Recognition and NLP. Writing to share because I was inspired when others did.