Spam E-Mail Classification using Machine Learning

Hema Kalyan Murapaka
9 min readApr 16, 2023

E-Mail has become an essential mode of communication in today’s world, and spam emails are a significant problem for email users. Spam emails can be a nuisance, waste time, and even be a security risk, as they can contain phishing links or malware. Therefore, it is essential to classify emails into spam and non-spam categories accurately.

One of the most effective ways to classify spam emails is by using machine learning algorithms. These algorithms can analyze the text, structure, and metadata of emails to determine whether they are spam or not. This technique is known as spam email classification.

In this blog post, we will explore how spam email classification can be done by using machine learning in detail. We will discuss the various techniques and algorithms used for spam email classification, the challenges associated with it, and the potential solutions. We will also look at some of the real-world applications of spam email classification, such as email filtering and fraud detection. Here is the Project Code link: @Kalyan45-Github

So, if you are interested in learning more about spam email classification, then keep reading!

Let’s consider the project into several phases:
1. Data Collection
2. Data Preprocessing
3. Data Visualization
4. Model Training
5. Model Testing
6. Model Evaluation
7. Model Deployment

Initially, We need to have a basic idea of the required Python libraries. Let’s start the project implementation by importing the libraries.

# Importing Required Libraries

import warnings
warnings.simplefilter('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let’s discuss the above-mentioned libraries in detail

warnings: This is a built-in Python library that provides a way to ignore warning messages generated by other libraries or modules. In this case, the code is set to ignore any warnings that are generated, which allows the code to run without interruption.

NumPy (Numerical Python): This is a popular Python library for numerical computing. It provides support for multidimensional arrays, mathematical functions, linear algebra, random number generation, and more. NumPy is often used in data analysis, scientific computing, and machine learning.

Pandas: This is another popular Python library for data manipulation and analysis. It provides a high-level interface for working with structured data, including support for reading and writing data to and from various file formats, data filtering and cleaning, and data visualization. Pandas are often used in data science and machine learning applications.

Matplotlib: This is a widely used Python library for creating static, animated, and interactive visualizations in Python. It provides support for creating various types of charts, graphs, and plots, including line plots, scatter plots, bar charts, histograms, and more. Matplotlib is often used in data analysis, scientific computing, and machine learning to visualize data and insights.

There are some other Python libraries required to execute our project. We will discuss this at the time of its execution. Let’s move forward to the next step.

Data Collection:

Data collection is the process of gathering and acquiring information, facts, and statistics from various sources for analysis, interpretation, and decision-making. It involves identifying the relevant data sources, extracting the necessary data, and organizing the data in a meaningful way. We can extract data from databases or servers. Here, we took the dataset from Kaggle. You can find the dataset in my above-mentioned GitHub profile. Let’s see how to import the dataset.

data = pd.read_csv("C:\Users\Desktop\Machine Learning\SPAM.csv")

If you are getting an error like this “SyntaxError: (unicode error) ‘unicodeescape’ codec can’t decode bytes in position 2–3: truncated \UXXXXXXXX escape”. Put “r” on the left side of the starting double quote.

In the above line of code, we have used a method called “read_csv()” a function from the pandas library in Python, used to read a CSV file and create a Data frame object. The data frame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labelled axes (rows and columns). You can check the Pandas Documentation for further information.

Data Pre-Processing:

Data preprocessing is a crucial step in the data analysis process. It involves cleaning and transforming raw data to make it suitable for analysis. You can refer to my Life Cycle of Machine Learning Blog for more information about Data Pre-processing. In this Project, We have done the pre-processing as the following:

(a). Checking for Null values:

data.isnull().sum()

isnull().sum() is a Pandas function used to count the number of missing values (NaN) in each column of a data frame. It returns a Series object where the index represents the columns of the data frame and the values represent the count of NaN in each column. In this case, We are having no Null values.

(b). Label Encoding:

data.loc[data['Category'] == 'spam', 'Category',] = 0
data.loc[data['Category'] == 'ham', 'Category',] = 1

These two lines of code are used to convert the values in the ‘Category’ column from categorical values (‘spam’ and ‘ham’) to numerical values (0 and 1) for easier analysis and modelling. After running these two lines of code, the ‘Category’ column will have numerical values of 0 and 1 instead of the original categorical values of ‘spam’ and ‘ham’.

(c). Input and Target Value Assignment:

X = data['Message']
Y = data['Category']

In the code above, X is a variable that stores the messages from the dataset and Y is a variable that stores the corresponding labels (categories) for each message. X is a pandas Series object that contains the textual data, whereas Y is also a pandas Series object that contains the labels (i.e., spam or ham) corresponding to each message in X. These variables are usually used as input to train a machine learning model, where X serves as the feature data and Y as the target variable.

(d). Data Splitting:

from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=3)

This code snippet uses the train_test_split function from scikit-learn to split the data into training and testing sets. X and Y are the input features and target variables respectively. test_size=0.2 specifies that 20% of the data should be used for testing, while 80% should be used for training. random_state=3 is an optional parameter that sets the random seed for reproducibility.

The function returns four variables:

  • X_train and Y_train are the training data.
  • X_test and Y_test are the testing data.

The training data is used to train the machine learning model, while the testing data is used to evaluate the performance of the model.

(e). Feature Engineering:

from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(min_df = 1, stop_words='english', lowercase = 'True')

X_train_features = tf.fit_transform(X_train)
X_test_features = tf.transform(X_test)

Feature engineering is the process of transforming raw data into features that can be used to improve the performance of machine learning models. In this context, we are using TF-IDF (Term Frequency-Inverse Document Frequency), A popular text feature extraction algorithm that assigns a weight to each term in a document based on how frequently it appears in that document, as well as how unique it is across all the documents in the corpus.

We can observe from the line of code, for X_train we used the method fit transform() whereas for X_test we used just transform() because When you call fit transform() on the training data X_train, the feature extraction method learns the vocabulary from the text data and generates the corresponding document-term matrix. This process creates a mapping from the original text data to the new feature space and when you call method transform() on the test data X_test, the feature extraction method maps the test data to the same feature space that was learned from the training data. This ensures that the feature space is consistent across the training and test data.

Therefore, it is important to call fit_transform() on the training data and transform() on the test data using the same feature extraction method to ensure that the feature space is consistent.

Model Training & Testing:

Model training is the process of training a machine learning model using a set of input data, with the goal of producing a model that can make accurate predictions on new, unseen data. After training the model, we can use it to make predictions on the testing data. To do this, we call the predict() function of our trained model on the X_test_features data, which contains the preprocessed features of the testing messages. In this project, we are using the following Machine learning models:

1. Logistic Regression
2. Decision Trees
3. K Nearest Neighbors
4. Random Forest
5. Stacking Model

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
# Logistic Regression

lr = LogisticRegression()
lr.fit(X_train_features, Y_train)
lr_train = lr.predict(X_train_features)
lr_test = lr.predict(X_test_features)

# Decision Trees

dtrees = DecisionTreeClassifier()
dtrees.fit(X_train_features, Y_train)
dt_train = dtrees.predict(X_train_features)
dt_test = dtrees.predict(X_test_features)

# K Nearest Neighbors

knn = KNeighborsClassifier()
knn.fit(X_train_features, Y_train)
knn_train = knn.predict(X_train_features)
knn_test = knn.predict(X_test_features)

# Random Forest

rf = RandomForestClassifier()
rf.fit(X_train_features, Y_train)
rf_train = rf.predict(X_train_features)
rf_test = rf.predict(X_test_features)

# Stacking Model

estimators = [ ('lr', lr), ('dtree', dtrees), ('knn', knn), ('rf', rf) ]
stack = StackingClassifier(estimators, final_estimator = SVC(kernel='linear'))
stack.fit(X_train_features, Y_train)
stack_train = stack.predict(X_train_features)
stack_test = stack.predict(X_test_features)

Model Evaluation:

Model evaluation is the process of assessing the performance of a machine learning model on a set of test data. The main purpose of model evaluation is to measure how well the model can generalize to new, unseen data. In this context, after training and testing our model, we can evaluate its performance by calculating different metrics such as accuracy, precision, recall, F1 score, and confusion matrix.

Accuracy: It measures the proportion of correct predictions among the total number of predictions made. It is defined as the number of correct predictions divided by the total number of predictions.

Precision: It measures the proportion of true positive predictions among the total number of positive predictions made. It is defined as the number of true positive predictions divided by the sum of true positive and false positive predictions.

Recall: It measures the proportion of true positive predictions among the total number of actual positive instances in the test data. It is defined as the number of true positive predictions divided by the sum of true positive and false negative predictions.

F1 score: It is the harmonic mean of precision and recall, and it provides a balanced measure between precision and recall.

Confusion matrix: It is a table that shows the number of true positive, false positive, true negative, and false negative predictions made by the model.

We can use these metrics to evaluate the performance of our spam classification model and make any necessary adjustments to improve its accuracy and efficiency.

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
# Logistic Regression

lr_train_acc = accuracy_score(Y_train, lr_train)
lr_test_acc = accuracy_score(Y_test, lr_test)
lr_precision = precision_score(Y_test, lr_test)
lr_recall = recall_score(Y_test, lr_test)
lr_f1 = f1_score(Y_test, lr_test)

# Decision Trees

dt_train_acc = accuracy_score(Y_train, dt_train)
dt_test_acc = accuracy_score(Y_test, dt_test)
dt_precision = precision_score(Y_test, dt_test)
dt_recall = recall_score(Y_test, dt_test)
dt_f1 = f1_score(Y_test, dt_test)

# K Nearest Neighbors

knn_train_acc = accuracy_score(Y_train, knn_train)
knn_test_acc = accuracy_score(Y_test, knn_test)
knn_precision = precision_score(Y_test, knn_test)
knn_recall = recall_score(Y_test, knn_test)
knn_f1 = f1_score(Y_test, knn_test)

# Random Forest

rf_train_acc = accuracy_score(Y_train, rf_train)
rf_test_acc = accuracy_score(Y_test, rf_test)
rf_precision = precision_score(Y_test, rf_test)
rf_recall = recall_score(Y_test, rf_test)
rf_f1 = f1_score(Y_test, rf_test)

# Stacking Model

stack_train_acc = accuracy_score(Y_train, stack_train)
stack_test_acc = accuracy_score(Y_test, stack_test)
stack_precision = precision_score(Y_test, stack_test)
stack_recall = recall_score(Y_test, stack_test)
stack_f1 = f1_score(Y_test, stack_test)

These are the results:

Model Deployment:

Model deployment is the process of making the trained model available for use in the production environment. This involves integrating the model with the application or system that will use it to make predictions on new data. For deployment, We need to have pickle files for that we need to execute the following Code.

import pickle
# do it for tfidf and model
with open('my_file.pkl', 'wb') as file:
pickle.dump(my_object, file)
import streamlit as st
import pickle
from PIL import Image
st.set_page_config(page_title="Spam E-Mail Classification")

# Use color and font themes

st.markdown("""
<style>

div[class*="stTextInput"] label p {
font-size: 26px;
}
</style>
""", unsafe_allow_html=True)

tfidf = pickle.load(open(r"C:\Users\Desktop\Projects\Spam E-Mail Classification\Pickle Files\feature.pkl", 'rb'))
model = pickle.load(open(r"C:\Users\Desktop\Projects\Spam E-Mail Classification\Pickle Files\model.pkl", 'rb'))

st.title("Spam E-Mail Classifier")

image = Image.open(r"C:\Users\Desktop\Projects\Spam E-Mail Classification\Data Source\images.jpg")
st.image(image, use_column_width=True)


input_mail = st.text_input("Enter the Message")

if st.button('Predict'):

vector_input = tfidf.transform([input_mail])

result = model.predict(vector_input)

st.success("This is a " + ('Spam Mail' if result == 0 else 'Ham Mail'))

In this blog post, we have explored the problem of spam email classification and demonstrated how it can be tackled using machine learning techniques. We discussed the importance of accurately identifying spam emails and avoiding false positives or negatives, and we explained how machine learning models can be trained on a dataset of labelled emails to automatically classify new emails as either spam or legitimate.

If you learned something new or enjoyed reading this article, please clap it up 👏 and share it so that others will see it. Feel free to leave a comment too.

Visit my Web Portfolio

--

--