Machine Learning Techniques for Spam Detection in Email

Alina Tabish
12 min readAug 23, 2022

--

A Comparative Analysis

Photo by Brett Jordan on Unsplash

The increase in the number of unwanted emails known as spam has produced a need for the development of more reliable and effective anti-spam filters. Recently, machine learning algorithms have been effectively utilised to detect and filter spam emails. In the field of natural language processing, there are numerous algorithms available for this kind of classification. Typically, spam emails contain a few recognisable terms that are pretty obvious indicators that the email is spam.

In this article, we will go through the processing of the data, exploring the data and applying the algorithms to find and compare the efficiency of several machine learning techniques such as KNN, Random Forest, Naive Bayes, SVM, and Logistic Regression, and more approaches. A collection of data consisting of around 60,000 emails including both authentic and spam emails was utilised in this investigation. These strategies are also thoroughly compared in terms of accuracy, precision, recall, etc.

In general, all email communications are labelled as “Ham” or “Spam.” In a mailbox, Ham communications are intended or safe acceptable messages, whereas Spam messages are trash, unwanted mass, or commercial messages. This filtering or categorization of email communications into Ham and Spam aids in separating them and automating the deletion of spam messages. Typically, there are several variables or components that contribute to the detection of spam emails. The below figure shows how a typical spam detection works in general.

Ham/Spam Email Detection

Importing The Relevant Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from wordcloud import WordCloud
from os import walk
from string import punctuation
from random import shuffle
from collections import Counter
import multiprocessing
import email


import sklearn as sk

from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
%matplotlib inline

Dataset Selection

The Enron Spam dataset was used for this analysis. The dataset includes 3675 spam and 1437 non-spam (“ham”) e-mail messages (5112 records total). The original datasets, on the other hand, are documented in such a way that each and every email is in its own txt-file, which is spread across numerous folders. This can make understanding the data more difficult, especially for novices. Because data collection is such an amazing resource, let’s organize the data into a single CSV file.

Importing Raw Data

import os
for (root,dirs,files) in os.walk('enron_dataset', topdown=True):
print (root)
print (dirs)
print (files)
print ('--------------------------------')

Reading the whole data from the Enron Dataset into a variable allData.

pathwalk = walk(r"enron_dataset")

allHamData, allSpamData = [], []
for root, dr, file in pathwalk:
if 'ham' in str(file):
for obj in file:
with open(root + '/' + obj, encoding='latin1') as ip:
allHamData.append(" ".join(ip.readlines()))

elif 'spam' in str(file):
for obj in file:
with open(root + '/' + obj, encoding='latin1') as ip:
allSpamData.append(" ".join(ip.readlines()))

Storing it in a dataframe

hamPlusSpamData = allHamData + allSpamData
labels = ["ham"]*len(allHamData) + ["spam"]*len(allSpamData)

raw_df = pd.DataFrame({"email": hamPlusSpamData,
"label": labels})

Now the instructed data has been organized into a dataframe.

Enron Dataset

Checking and exploring the data for null/missing values

raw_df.info()

Since there are no null values, let’s move on to text processing of the data

Text Processing

Building a machine learning model requires the preprocessing of data, and the quality of the preprocessing determines the model’s performance.

Preprocessing text is the initial stage in NLP’s model-building process.

Following are the various text preparation steps:

  • Removing Punctuations
  • Tokenization
  • Removing Stopwords
  • Stemming
  • Lemmatization

i. Removing Punctuations

The initial step was to remove any punctuation from the data, such as commas and full stops.

import string# Function to remove punctuations.
def remove_punc(text):
nonP_text = "".join([char for char in text if char not in string.punctuation])
return nonP_text

data["body_text_clean"] = data["email"].apply(lambda x: remove_punc(x))

data.head()

ii. Tokenization

Tokenization is a common task in Natural Language Processing (NLP). Tokenization is the process of breaking down a text into smaller pieces known as tokens. In this context, tokens can be words, letters, or subwords. Tokenization may therefore be divided into three types: word, character, and subword

import re

#function to apply tokenization
def tokenize(text):
tokens = re.split("\W+", text)# W+ means all capital, small alphabets and integers 0-9
return tokens

data["body_text_tokenized"] = data["body_text_clean"].apply(lambda x: tokenize(x))

data.head()

iii. Removing Stopwords

In NLP, stop words are worthless words (data). To remove stop words from strings, you have a plethora of alternatives in the Python programming language. You can use one of the many natural language processing libraries available, such as NLTK, SpaCy, Gensim, TextBlob, and others, or you can develop your own custom script if you require complete control over the stop words you wish to delete. Let’s use NLTK to remove the stopwords.

import nltk
stopwords = nltk.corpus.stopwords.words("english")

def remove_stopwords(token):
text = [word for word in token if word not in stopwords]# to remove all stopwords
return text

data["body_text_nonstop"] = data["body_text_tokenized"].apply(lambda x: remove_stopwords(x))
data.head()

iv. Stemming

Stemming is the process of reducing a word to its word stem, which affixes to suffixes and prefixes or to the roots of words known as the lemma.

Stemming
ps = nltk.PorterStemmer()

def stemming(t_text):
text = [ps.stem(word) for word in t_text]
return text

data["body_text_stemmed"] = data["body_text_nonstop"].apply(lambda x: stemming(x))
data.head()

v. Lemmatization

The technique of collecting together the various inflected forms of a word so that they can be analysed as a single item is known as lemmatization.

Lemmatization
wn = nltk.WordNetLemmatizer()

def lemmatizer(t_text):
text = [wn.lemmatize(word) for word in t_text]
return text

data["body_text_lemmatized"] = data["body_text_stemmed"].apply(lambda x: lemmatizer(x))
data.head()

After all the pre-processing, the original length consisted of 7175231 words whereas the cleaned length consisted of 6837440 words, so the total extra words removed were about 337791.

The data after pre-processing looked like this:

Pre-processed Data

Feature Selection

A total of 100 words are taken from a training batch of spam emails or texts to create feature vectors.

i. Spamicity

Pr(word|Spam) = the chance that a certain term appears in spam messages.

Pr(word|Ham) = the likelihood that a specific word appears in the ham messages.

Spamicity (for a certain term) is defined as follows:

Spamicit(word)=Pr(word|Spam)Pr(word|Spam)+Pr(word|Ham)

ii. Word Selection

This entails ranking every word and picking a total of 100 top-ranked terms. Assigning the highest score to a given term is obviously dependent on its spamicity and is far from optimal (0.5, higher or lower). Proximity to 1 implies that the message might be spam, whereas proximity to 0 indicates that the message is ham.

It has also been demonstrated that spamicity alone is insufficient to act as a signal of ham or spam. Words having lower Pr(word|Spam) and Pr(word|Ham) values, with spamicity values well below 0.5, do not fulfil the goal of being good indicators.

As a result, it has been suggested that you look for the size of the difference:

|Pr(word|Spam) — Pr(word|Ham)|

Word Selection process:

  • The words with |spamicity — 0.5| < 0.05 are filtered out.
  • A threshold (maybe as low as 1%) is employed to exclude phrases that appear only seldom in ham and spam communications.
  • Calculate |Pr(word|Spam) — Pr(word|Ham)| for all the remaining words and select the top 100 words.
Loud Words in Spam Email

Algorithm Application

For classification and the development of systems for auto spam detection, the following Machine Learning techniques can be used:

1. Naive Bayes

The Bayes theorem underpins the Naive Bayes classifier. It is assumed that the predictors are independent, which means that knowing the value of one attribute influences the value of any other attribute.

The posterior probability is calculated as P(c|x) from P(c), P(x) and P(x|c) by methods provided by Bayes Theorem.

This technique has the advantages of having a quick training speed, which facilitates the computation of the mean and variance of the training data, is based on statistical modelling, and is straightforward to apply. The downside of this strategy is that it does not hold well when data is correlated or when the premise of data independence fails, and it is influenced by zero probabilities (which arise when the product of individual probabilities = 0; owing to missing values).

The Naive Bayes confusion matrix for the Enron dataset is provided as

Confusion Matrix for Naive Bayes

ii. Support Vector Machine

The Support Vector Machine (SVM) is a well-known method for reliably classifying huge feature spaces. Support Vector Machine is a statistical model that uses machine learning approaches to represent complicated interactions between variables.

The main principle underlying Support Vector Machine is to differentiate information from one class from that of another by utilising an ideal hyperplane that has the maximum distance or margin to the nearest training data points of any class since it has the finest generalisation capabilities. As illustrated in the figure, the idea of a hyperplane is employed to distinguish the two classes.

The discussed method has the following advantages: it is extremely influential for high-dimensional spaces, and it is very effective in memory management since its decision function uses a subset of training points. The downsides of this model are that it may not be efficient if the number of features is more than the number of samples, and direct probability values are not accessible, hence cross-validation is necessary. The confusion matrix for SVM applied on the Enron dataset is given as

Confusion Matrix for SVM

iii. Random Forest

Random forests (RFs) are a well-known ensemble learning and regression approach that may be used to solve data categorization challenges.

Random forests have several advantages over other machine learning techniques, including lower classification error and greater f-scores. Furthermore, In general, performance is comparable to or even better than that of SVMs. It is capable of handling missing values in imbalanced data sets It functions as an efficient calculation algorithm. The confusion matrix applied to the Enron dataset is given as

Confusion Matrix for Random Forest

iv. K Nearest Neighbors

The k-nearest neighbour algorithm (KNN) is a method for classifying objects in n-dimensional pattern space based on the closest training samples. When presented with an unknown tuple, the classifier explores the pattern space for the k training tuples that are most similar to the unknown tuple. These k training tuples are the unknown tuple’s k nearest neighbours.

The KNN method has the benefit of producing output with high accuracy for small datasets and taking into account all of the features included in the dataset. The downsides of the KNN approach are that it computes all of the training examples for every test instance during classification, resulting in high temporal complexity during the testing phase, raising the computational cost even further, and it requires a considerable amount of memory.

The confusion matrix for KNN applied on the Enron dataset is given as

Confusion Matrix for KNN

v. Logistic Regression

Logistic Regression is a popular machine learning binary classification approach that produces discrete outputs.

Logistic regression calculates probabilities using its underlying logistic function to assess the connection between the dependent variable, such as the class label, and one or more independent variables, such as characteristics. These probabilities must then be transformed into binary values using the logistic function, also known as the sigmoid function, in order to produce a forecast. The sigmoid function is an S-shaped curve with a real-valued input t, (tR).

LR converts the input into a value between 0 and 1, but not exactly between those boundaries. These values between 0 and 1 are then turned into either 0 or 1 using a threshold classifier. The sigmoid function returns probabilities ranging from 0 to 1. In our model, logistic regression will tell us whether the message is spam or not. If the value is 1, the message is spam; if the value is 0, the message is ham.

The confusion matrix for Logistic Regression on the Enron dataset is as follows:

Confusion Matrix for Logistic Regression

Experiment and Results

When it comes to spam categorization, machine learning algorithms play a critical role. This study discusses five important machine learning models used in spam categorization. E-mail communications are made up of several elements, including a header, content, and so on.

The body is the major section of the e-mail message that determines the structure of the message in order to proceed with the preparation stages. Several Body elements are chosen to characterise or categorise the words as spam, hence designating the communication as spam. When considering techniques, they are chosen based on the features supplied or how the message should be classified.

To define an algorithm’s performance, we examine the parameters such as:

I. Accuracy

The Accuracy parameter indicates the percentage of correct predictions. It does not consider positives or negatives separately, thus other performance measurements are employed in addition to accuracy. The maximum accuracy is indicated by the number 1, which is calculated as:

Python was used to implement the ML methods outlined in earlier parts. The e-mails that are correctly categorised and categorised as a percentage of all e-mails are considered based on the accuracy score. It defines how exactly the method performs; in this research, the best accuracy obtained on SVM is 0.9987, while the lowest accuracy obtained on Random Forest is 0.93742.

A bar graph of the accuracy vs classifier is given in the figure below.

Accuracy vs Classifier

ii. Precision

Precision is defined as the ratio of True Positives to all Positives or the proportion of recollected relevant events. The precision is calculated as follows:

Spam Precision is the proportion of linked spam e-mails found among all e-mails. Counts the number of e-mails identified and labelled as spam. The highest spam precision was 1 gotten on SVM and the lowest spam precision was 0.92 on Random Forest.

iii. Recall

The recall is a measure of how well our model identifies True Positives (Vidhya, 2020a). The recall is computed as the ratio of Positive samples that were properly categorised as Positive to the total number of Positive samples. The recall of the model assesses its ability to recognise Positive samples.

The recall is solely concerned with how positive samples are categorised. This is independent of how negative samples are categorised, for example, for precision. When the model wrongly labels all of the positive samples as Positive, the recall is 100%, even though all of the negative samples were incorrectly categorised as Positive. Spam Recall refers to spam e-mails that have been accurately identified and classed as spam based on all spam e-mails analysed. In this paper, all the algorithms got spam recall 1 except for Naive Bayes

iv. F-Measure

The F-score, often known as the F1-score, is the accuracy of a model on a dataset. It is used to evaluate binary classification techniques that classify examples as either ‘positive’ or ‘negative.’

The F-score is a method for combining the model’s precision and recall; it is defined as the harmonic mean of the precision and recall of the model.

The spam f-score was 1 for SVM and the lowest spam f-score was 0.96 for Random Forest.

The following table provides the perimeters of each technique.

Comparison Table

Many present email spam filtering algorithms are incapable of successfully handling some of the spam sent on a regular basis by spammers. This is because spammers continued to develop more advanced tactics to avoid detection by spam filters. With spammers’ constant adoption of new techniques, email spam filtering has emerged as a major study topic for academics. We propose the Support Vector Machine algorithm for effective and efficient email spam filtering in this paper. In addition, the algorithm’s efficacy and efficiency were evaluated using accuracy, TPR, FPR, precision, and F-measure on Enron spam datasets. We conclude by saying that SVMs are a promising method that may be used either at the mail server or at the mail client to reduce the number of spam messages in email users’ inboxes.

--

--