Email Spam Filtering using ML Classification Algorithms

Shantanu Gupta
7 min readJan 29, 2020

--

Spam Filtering with Machine Learning

Abstract

Spam Emails are a constant source of frustration to the average Internet user. The problem of filtering out spam emails is an important one for Email Service Providers to minimize this unsolicited form of communication which is often a cybersecurity threat as well. Here I study two popular classification algorithms of Machine Learning, Logistic Regression (LR) and Support Vector Machines (SVM) to help solve this problem and draw results about the suitability of the two techniques.

1 Introduction

Email communication has become crucial in today’s world. Email spam is a kind of unsolicited messages sent in bulk by email. A common terminology to describe an email as not spam is “Ham”, meaning an email is either Ham or Spam. According to the statistics provided on the website [1], the ratio of email spam to all emails received in March 2019 was 56%, which is significantly lower than 69% as seen in 2012. The significant decrease in this ratio has been due to the rising interest in the area of predictive analytics and machine learning techniques and the growth in computation power to practically implement the theory in place.

1.1 Statistical classification

In statistics and machine learning; classification, which is an instance of supervised learning techniques, is the problem of identifying the association of a given object to one of a set of categories, based on the already categorized training data available to the classifier. Binary classification is the type of classification where the observation can be categorized into one of two categories, which is relevant to the problem of spam email filtering. There is a wide range of classification algorithms like Support Vector Machines, Logistic Regression, Naive Bayes, Linear Discriminant Analysis, Decision Trees and k-Nearest Neighbor algorithm.

1.1.1 Logistic Regression

Logistic Regression is a statistical model which uses a logistic function to model a dependent variable based on the values of independent variables. The independent variables are the input to any statistical model and the dependent variables are the outcome based on the inputs. A logistic function is a Sigmoid Curve following the equation:

where x is the linear prediction made by the algorithm

Logistic Regression estimates the parameters of a logistic model. A binary logistic model has a dependent variable with two possible values, in this case, 0 for spam and 1 for ham. We are interested in estimating the probability that a given email is spam. Mathematically, we want to estimate the probability

Our prediction function f(x) returns a probability between 0 and 1. Based on this score, we want to define a decision boundary based on a threshold value. We assume that a given email is equally likely to be ham or spam. Thus if

classify the email as spam; otherwise ham. We model the likelihood of the given data using the equation

where βi is the coefficient of sample X and β0 is a constant

We then use a cost function to penalize confident but wrong predictions on the training set, with a low reward for confident and correct predictions. [2]

1.1.2 Support Vector Machine

The basic Support Vector Machine (SVM) is a type of non-probabilistic binary linear classification algorithm. The objective in SVM is to maximize the separation of different classes using boundaries formed by the positive and negative samples. The problem is to define a max-margin hyperplane defined by the separation boundaries.

where {\displaystyle {\vec {w}}} is the vector normal to the hyperplane.

The concept has been adapted to accurately perform non-linear classification using kernels to map the inputs to a higher dimensional space. It works by constructing an equidistant maximum margin hyperplane separation between the positive and negative samples in a high-dimensional vector space. Often the data is not linearly separable, thus the need for a non-linear separator. Even in the case of non-linear separation, it is possible that the separation is not perfect, meaning there is a certain level of error that cannot be ignored. The optimization problem is to minimize the cost function of the error for meaningful classification. [3]

1.2 The Enron Data Set

The Enron data set [4] is a real email dataset collected for research. The dataset has been preprocessed for personally identifiable information and other integrity or confidentiality concerns. It remains one of few data sets with real email communications with regards to research use for various purposes. In the context of this problem, the dataset will be randomly pruned to the training set for learning the model and testing set for validation of the predictions. Each message is in a separate text file.

2 Notation and Assumptions

The dataset used for the experiments has been preprocessed for research use. This means that the data is devoid of any HTML markups etc. The email data is available in plain text and perfectly labeled for learning purposes.

3 Implementation

3.1 Feature Extraction

A subset of 350 ham emails and 350 spam emails has been used to learn the model, which has then been used to classify 260 random emails as ham or spam. Since the dataset in use is preprocessed, no further preprocessing has been applied. A feature vector matrix has been created using the words in the emails which are used as the input to the training algorithm.

3.2 LR Algorithm

The open-source machine learning library sklearn has been used for the implementation of the classification algorithm with the parameters as shown in the description of the concept. The input to the algorithm is a 2-D array of feature vectors of the training set and the respective class labels, 0 or 1. The learned model is then used to predict the unlabeled testing set with the results shown in the next section.

3.3 SVM Algorithm

Again, the sklearn library has been used to implement the SVM algorithm. Here we have learned the model using two different kernels, the Gaussian kernel, and Polynomial kernel. The input to both models is the same as that of the LR algorithm. The interesting thing to note here is that the optimization solver for SVM selects a random set of features for learning and therefore a slight variation in the results was observed in different runs.

4 Key Results

5 Conclusion

The results on the given dataset show that both LR and SVM are good classification techniques suited to the problem of email classification for spam filtering. Although SVM has a tradeoff for higher computation requirements and training time, the results are significantly better as compared to Logistic Regression. The experimentation also shows that the results achieved by SVM with a Polynomial Kernel give better results as compared to Gaussian Kernel under the current parameters of the approach in use. This can be attributed to the use of the full feature set of the training data in the algorithms, whereas the Gaussian kernel seems to outperform other kernels on a reduced feature set of the training data. [5]

The experiments have been performed with a sample-feature matrix for classification algorithms without any sort of dimensionality reduction. Further experimentation with dimensionality reduction techniques like Term Frequency-Inverse Document Frequency would be a good study to explore the problem in further depth and at a larger scale.

References

[1] https://www.statista.com/statistics/420391/spam-email-traffic-share/

[2] Feroz, Mohammed & Mengel, Susan. (2015). Examination of data, rule generation and detection of phishing URLs using online logistic regression. Proceedings — 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. 241–250. 10.1109/BigData.2014.7004239.

[3] Metsis, Vangelis & Androutsopoulos, Ion & Paliouras, Georgios, (2006) Spam Filtering with Naive Bayes — Which Naive Bayes?, CEAS.

[4] Enron Email Dataset prepared by CALO project 2004 link

[5] Blanzieri, E. & Bryl, A, (2008) A survey of learning-based techniques of email spam filtering, Artificial Intelligence Review

--

--