Spam or Ham? Email Classifier Using Python (MultinomialNB v/s XGBoost Classifiers)

Raghav Palriwala
Analytics Vidhya
Published in
8 min readJul 11, 2020
Image via cattu (pixabay.com)

Hello there! Not a long back, I was sitting on my computer, awaiting a mail from my vendor for a big purchase order. After getting restless by the end of the day, I called up the guy and reasoned for the delay. He assured me that he had sent it in the morning itself despite I could not locate it in my inbox. Perplexed enough, I started scraping through all of my folders and to my amazement I found it resting in my Spam folder. I got curious and ended up learning how Google was classifying all of my emails automatically without letting me know. It safe to assume this might have happened with a lot of us out there. So I decided to make a spam or ham classifier for myself and see I could get it to work. Continue reading if you want to learn making one for yourself!

Also, check out my other posts for more such applications of machine learning algorithms. Do check, then share your insights through comments, and share with your friends to see what they think about it. You can also follow my articles to create such models and tweak them to your interests.

What are Spam Emails?

Spam email is unsolicited and unwanted junk email sent out in bulk to an indiscriminate recipient list. Typically, spam is sent for commercial purposes. It can be sent in massive volume by botnets, networks of infected computers. While some people view it as unethical, many businesses still use spam. The cost per email is incredibly low, and businesses can send out mass quantities consistently. Spam email can also be a malicious attempt to gain access to your computer. read more..

About the Project

This is a project I am working on while learning concepts of data science and machine learning. The goal here is to identify whether an email is spam or ham. We will take a dataset of labeled email messages and apply classification techniques. We can later test the model for accuracy and performance on unclassified email messages. Similar techniques can be applied to other NLP applications like sentiment analysis etc.

Data

I am using Spambase dataset from UCI’s ML Repository which can be downloaded from the link.

The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55–57) measure the length of sequences of consecutive capital letters. Here are the definitions of the attributes:

  • 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
  • 6 continuous real [0,100] attributes of type char_freq_CHAR] = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
  • 1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters
  • 1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters
  • 1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail
  • 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Model

We use Multinomial Naive Bayes Classifier and then XGBoost Classifier to fit the model looking for improvement in results. In the end, the accuracy score and confusion matrix tell us how well our model works.

Multinomial Naive Bayes Classifier

In statistics, Naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models. Naïve Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. read more..

XGBoost Regressor

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples. read more..

Developing the Model

Step 1: Load the necessary packages and read the data. The data provided here does not have columns labeled, so one might choose to update the labels for better understanding the data. To keep it clean, I have not pasted code to include columns in this article, although you can find it in my full code attached at the end of this article.

import pandas as pd
import numpy as np
import re
data = pd.read_csv('spambase_data', names=cols, header=None)
X = data.iloc[:, :-1]
y = data.classified
print('Data Table \n')
display(X)
print('\n\nTags Table')
display(y)

Output:

Now we know that we have 4600 email samples. Also, we notice that the dataset is already converted from words to numbers, so we can begin to build the ML model right away. You can read my article on creating a Fake News Detector, where I have discussed in detail the process of converting words to numbers.

Step 2: Split the dataset into training and testing subsets.

from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test = tts(X, y, test_size=0.3, random_state=0)

Step 3: Building a MultinomialNB Classifier model using training subset and then testing the effectiveness of model on test set.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

mnb = MultinomialNB()
mnb.fit(X_train, y_train)

predicted = mnb.predict(X_test)

score = accuracy_score(y_test, predicted)
print('Accuracy Score: \n', (100*score))

Output:

Accuracy Score: 80.95582910934105

We have achieved a score of ~81%. Which means about 20% of your emails will be misclassified.

Since we are working to classify Spam vs Non-Spam emails, it is crucial for us to avoid False-Positive classification, i.e., classifying a Non-Spam email as a Spam email. For this, we will check distribution of our classification.

Step 4: Creating Classification Report and Confusion Matrix (I have used seabron library to create a more illustrative output of confusion matrix, you can use the default sklearn.confusion_matrix() view aswell)to assess how our model is performing:

import seaborn as sn
from sklearn.metrics import confusion_matrix as cm
from sklearn.metrics import classification_report as cr

cm1 = cm(y_test, predicted, labels=[0, 1])
df_cm = pd.DataFrame(cm1, range(2), range(2))
sn.set(font_scale=1)
sn.heatmap(df_cm, annot=True, annot_kws={'size':14}, fmt='d').set_title('Confusion Matrix')

print('\nClassification Report: \n', cr(y_test, predicted))

Output:

Result from MultinomialNB Classifier

Looks like we have about ~10% of False Positive Classification. This cannot be good. Could this be the reason that the email from my vendor ended up in my Spam folder? Let me know in the comments section.

We have identified that such high percentage of false positives can make people lose some important emails. Let us now work with a more sophisticated classifier, and ensemble of random forest classifier, XGBoost.

Step 5: Creating an XGBoost model with training set, testing on test set and printing out the classification report and confusion matrix.

Note: XGBoost by default works as a regressor, so we get results as continuous numbers, acting as probabilities. We need to convert the probability into binary classification as per our needs.

from xgboost import XGBRegressor

xgb = XGBRegressor(n_estimators=120, leanring_rate=0.075)
xgb.fit(X_train, y_train)
#xgb.fit(X_train, y_train, early_stopping_rounds=10, eval_set=[(X_test, y_test)], verbose=False)

predicted1 = xgb.predict(X_test)
score1 = accuracy_score(y_test, (predicted1 > 0.5))
print('Accuracy Score on XGBoost: \n', (100*score1))
cm2 = cm(y_test, predicted1 > 0.5, labels=[0, 1])
df_cm = pd.DataFrame(cm2, range(2), range(2))
sn.set(font_scale=1)
sn.heatmap(df_cm, annot=True, annot_kws={'size':14}, fmt='d').set_title('Confusion Matrix')
print('\nClassification Report: \n', cr(y_test, (predicted1 > 0.5)))

Output:

Accuracy Score on XGBoost: 
94.6415640839971
XGBoost Result with False Positive

Using XGBoost Regressor, we have reduced False Positive classification to less than 3%. Moreover, we have reduced all False classification to less than 6% increased accuracy to ~95%.

Despite the high accuracy, it might not be acceptable to have 3% ham emails marked as spam. And to solve for that, we’ll now restrict any False Positive outcomes.

Step 6: We will repeat the step 5 as it is, with only one small change. While converting predicted output to probabilities, we will instruct the model to classify only those with probability more than 0.9 to be marked as spam.

score1 = accuracy_score(y_test, (predicted1 > 0.9))
print('Accuracy Score on XGBoost: \n', (100*score1))
cm2 = cm(y_test, predicted1 > 0.9, labels=[0, 1])
df_cm = pd.DataFrame(cm2, range(2), range(2))
sn.set(font_scale=1)
sn.heatmap(df_cm, annot=True, annot_kws={'size':14}, fmt='d').set_title('Confusion Matrix')
print('\nClassification Report: \n', cr(y_test, (predicted1 > 0.9)))

Output:

Accuracy Score on XGBoost: 
87.97972483707458
XGBoost Result with Minimal False Positive Allowed

While saving the misclassification of ham into spam, we have given up on our accuracy and come down to 88%. I believe reading a few extra spam is better than leaving out on some very important emails, like a purchase order!

Result

We have successfully created and implemented machine learning model using two different algorithms. We identified that in our case the ensemble random forest algorithm of XGBoost worked better than Naive Bayes based Multinomial algorithm.

We saw how precision and accuracy work in inverse proportion, where achieving one causes loss in the other.

Future Work

I intend to expend this project by adding a graphical user interface (GUI) where one can paste any piece of text and get its classification in the results. Write to me if you have some tips to implement this!

Reference

You can find my code here on GitHub.

If you liked my work, throw me some appreciation via sharing and following my stories. This will keep me motivated to share with you all as I keep learning newer things!

If you did not like my work, please share your thoughts and recommendations. This will help me improve and develop better readings for you the next time!

Thank you.

--

--