Email Spam Classifier Using Naive Bayes

Shubham Kumar Raj
Analytics Vidhya
Published in
8 min readMay 17, 2020

If Any of you want to know about the Basic of Machine Learning . I have Written a post on this topic in a very very Simple Language with real World Example, and with easy explanation about all the term and classification .After reading my Post you can answer anyone about the Basic of Machine Learning.

Here is the Link below :-

https://medium.com/@143jshubham/machine-learning-and-its-impact-on-our-generation-4c0dbc201c1a

A Little Introduction About the Project:-

These Informations are Gathered from Different Sources:-

Spam Email , become a big trouble over the internet. Spam is waste of time, storage space and communication bandwidth. The problem of spam e-mail has been increasing for years. In recent statistics, 40% of all emails are spam which about 15.4 billion email per day and that cost internet users about $355 million per year. Knowledge engineering and machine learning are the two general approaches used in e-mail filtering. In knowledge engineering approach a set of rules has to be specified according to which emails are categorized as spam or ham.

Machine learning approach is more efficient than knowledge engineering approach; it does not require specifying any rules . Instead, a set of training samples, these samples is a set of pre classified e-mail messages. A specific algorithm is then used to learn the classification rules from these e-mail messages. Machine learning approach has been widely studied and there are lots of algorithms can be used in e-mail filtering. They include Naive Bayes, support vector machines, Neural Networks, K-nearest neighbour, Rough sets and the artificial immune system.

Why We Using Naive Bayes as an Algorithms for Filtering the Email:-

Naive Bayes work on dependent events and the probability of an event occurring in the future that can be detected from the previous occurring of the same event . This technique can be used to classify spam e-mails, words probabilities play the main rule here. If some words occur often in spam but not in ham, then this incoming e-mail is probably spam. Naive Bayes classifier technique has become a very popular method in mail filtering Email. Every word has certain probability of occurring in spam or ham email in its database. If the total of words probabilities exceeds a certain limit, the filter will mark the e-mail to either category. Here, only two categories are necessary: spam or ham.

Here are Some Calculation Which help you to Understand how it work.

The statistic we are mostly interested for a token T is its spamminess (spam rating), calculated as follows:-

Where CSpam(T) and CHam(T) are the number of spam or ham messages containing token T, respectively.

Where CSpam(T) and CHam(T) are the number of spam or ham messages containing token T, respectively. To calculate the possibility for a message M with tokens {T1,……,TN}, one needs to combine the individual token’s spamminess to evaluate the overall message spamminess. A simple way to make classifications is to calculate the product of individual token’s spamminess and compare it with the product of individual token’s hamminess

(H [M] = Π ( 1- S [T ]))

The message is considered spam if the overall spamminess product S[M] is larger than the hamminess product H[M].

All the Machine Learning Algorithms works on two stages:-

  1. Training Stage.
  2. Testing Stage.

So In the Training Stage Naive Bayes create a Lookup table in which they store all the possibility of probability which we are going to use in the Algorithm for predicting the result.

And In the testing phase let Suppose you have given a test point to the algorithm to predict the result , they fetch the values from the lookup table in which they store all the possibility of probability and use that value to predict the result .

Now Our Main Work on Email Spam Classifier Start:-

First of all I want to make you clear that we have a folder name “e-mail” in which we have about 5172 file and each file is one of the e-mail and on each e-mail they mentioned that particular e-mail is spam or ham.

Our first target is to make a list of all the word which are used in that 5172 Email. For this we have some step:

  1. Load the “e-mail” folder in Jupiter Notebook With the help of OS in which each file is one Email.
import os
folder='Desktop/e-mail/'
files=os.listdir(folder)
emails=[folder+file for file in files]
  1. Open each file with the help of f=open(e-mail)
    In this f=open (e-mail) if you have give one file in f=open() it open that file to read.
  2. Read the File.
    f.read() it read all the content of that email file and store in string format.
  3. Split the file with the spaces (“ “)and append in the list.
words=[]
for e-mail in e-mails:
f=open(e-mail,encoding='latin-1')
blob=f.read()
words+=blob.split(" ")

In this time we have a list of Words in which we have all the words stored which are used in 5172 Email. But we don’t know which word occur how much time , for finding this we are going to import counter from collection ,this counter will give you the result that which word occur how much time
from collections import Counter
And pass the word list in counter it form a dictionary which show
which word occur how much time
word_dict=Counter(words)

Now we have a Word_dict in which we have store which word occur how much time but we don’t use all these word because it may reduce the accuracy of our Algorithm So we use the most Common 3000 word , you may take any number like 2500 word or any thing but Here I will take top 3000 word.We have a method to find the most common word from a dictionary
word_dict=word_dict.most_common(3000)

word_dict is look Something Like this:-
Here Key is the word and value is the how much time it occur.

Now this one is the very important part of this Email-Spam-Classifier:-

As we all know that the for training the data we have to make data in row and column Style , So we are going to make a tabular data in which there are some rows and columns.where each row is one of the email and each column is one the word which are present in that word_dict and the value is a integer which shows that the number of time that particular word from the word_dict occur in that particular email. So it form shape like (5172 x 3000) in which 5172 email and the all the 3000 most common word which are stored in word_dict.

Here we have made a table which we want to make :-

Here in Email-1 you see in row there are email and column there are word which are in the word_dict. and you see in first email there are the word “The” is occurs 2 times , “To” occurs 3 times like this , you see the last column is result in which there are two value 0 and 1 this show whether the given Email is Spam or not-spam .0 means spam and 1 mean not-spam.

Now its time to make this table with the help of Code just see and try to understand:-

I try to make you clear that how this code help to form a tabular format:-
First we take two empty list label and feature. After that we take one by one email and with the help of f=open(email,encoding=’latin-1') we open that file then we read that file and split on the basis of spaces(“ “) and store in the blob, Now we take each word from the word_dict (the, to, etc, for....)and check the Occurrence in the blob and store in the data list so inner for-loop run 3000 times, there are 3000 element appended in the data list(data=[]) and finally that data list appended in the feature list(feature=[]) ,it means there are 5172 data list is present in the feature list ,for every Email there are one data list. And at last we check if the word ”spam” is in the email label is appended by 1 if word “ham” is in the email label(label=[]) is appended by 0

Now In this condition we have feature list having length (5172 )
and label having length(5172,)

Here both feature and label is in list format for Training data we know we gave a Numpy Array so we change this list to Numpy Array.

features=np.array(features)
label=np.array(label)
shape(feature) must be (5172 x3000)
shape(label) must be (5172 ,)

Now we do train_test Split of data with the help of sklearn.model_selection

from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test=train_test_split(features,label,test_size=0.2)

Here you see test_size =0.2 it means 80% of data we give to the Algorithm to learn the or for training the model and rest 20% we we using for testing.

For Training the model we use Naive Bayes Algorithm

from sklearn.naive_bayes import MultinomialNB

Creating a object for this called clf
clf=MultinomialNB()

Now we give the data to the Algorithm to train the model

clf.fit(X_train,y_train)

Now your model is train you can also check how your model work with the help of accuracy_score.

from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

This accuracy score help us to predict the accuracy of your Algorithm.

Now Our Model is ready to predict, so we are going to take an input e-mail and then check whether our model predict correct or not.

Here we take a mail in a variable new_email. After that we split that whole email. We are then going to count that most common word in this input Email which are stored in word_dict. Next, we convert that list into a numpy array and reshape into (1,3000). Finally, we predict the result with our model which is present in clf object because all the logic of spam or Ham is present in clf object of our model. Here 1 represent Spam and 0 represent Ham.

If you want to see clean and clear code from Scratch you can visit my Github source code. Here is the link :-

If Any one of want to Convert this Email-Spam Classifier in Website on which you want take an input Email from the User and with the help of Email-Spam-Classifier Algorithm you want to Predict the Result wether the Inputed Email Is Spam or Ham. For this You. can visit my very next Blog on Medium .
Here Is the Link Below:- Coming Soon…..

Conclusion :-

Email Spam Classifier is one of the Best Project in the Machine Learning Field.
Filtering the Spam e-mail really helps a lot because it become a big trouble over the internet.Machine Learning is the best way to filter the spam out.In this Filtering process we can use multiple of Algorithm but We are Working on Naive Bayes because the performance or Accuracy of this Algorithm is Better than others.

--

--