Spam Detection using Machine Learning Methods

Coursesteach
7 min readApr 13, 2024

--

📚Introduction

Big tech companies are always working on making sure they catch all those annoying spam emails and messages before they reach you. It’s a top priority for them to keep their customers happy and spam-free.Apple’s iMessage and Google’s Gmail are awesome at catching spam so you don’t have to deal with annoying spam messages. if you want to make a spam detection system, this article is perfect for you. I’ll show you how to detect spam using Machine Learning and Python.

📚Sections

Motivation
Features
Import Libraries
Data Loading
Data Preprocessing
Data Splitting
Model Training and Testing
Hyperparameter Optimization using Grid Search CV
Model Evaluation
Results
References

😇 Motivation

Embarking on the journey of creating a spam detection system holds immense promise and significance in our modern digital landscape. In an era where big tech companies prioritize customer satisfaction and strive tirelessly to combat spam, there arises an opportunity for us to contribute meaningfully to this ongoing battle. With Apple’s iMessage and Google’s Gmail setting the benchmark for spam detection, our endeavor to develop a similar system using Machine Learning and Python is not just a technical pursuit but a quest to enhance the online experience for countless individuals worldwide. By delving into this project, we embrace the chance to empower users, alleviate digital nuisances, and foster a safer, more enjoyable online environment. Let us embark on this journey with passion, curiosity, and determination, knowing that our efforts have the potential to make a tangible difference in the lives of many.

⭐ Features

Certainly! Here are some key features for the spam detection system project:

Data Collection and Preprocessing: Implement a robust data collection mechanism to gather a diverse dataset of both spam and legitimate messages. Preprocess the data to extract relevant features and prepare it for training.

Machine Learning Models: Utilize various machine learning algorithms such as Naive Bayes, Support Vector Machines (SVM), or Neural Networks to train models for spam detection. Experiment with different models to determine the most effective approach.

Initial Accuracy: Starting with a baseline accuracy of approximately 90% with the existing classifier.

Accuracy Improvement Goal: Setting a target of achieving an 80% accuracy rate, signifying a substantial enhancement from the initial performance level.

Feature Engineering: Explore different feature engineering techniques to enhance the performance of the models. This may include TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings, or other text representation methods.

Model Evaluation: Develop a comprehensive evaluation strategy to assess the performance of the trained models. Utilize metrics such as accuracy, precision, recall, and F1-score to measure the effectiveness of the spam detection system.

Real-time Detection: Design the system to perform real-time spam detection, allowing users to receive immediate protection against spam messages as they arrive.

📚Import Libraries


import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split , GridSearchCV , KFold
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score , classification_report , confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from nltk.stem import PorterStemmer
from sklearn import metrics

📚Data Loading

df = pd.read_csv("https://raw.githubusercontent.com/Sanjay-dev-ds/spam_ham_email_detector/master/spam.csv", encoding= 'latin-1')
df.head()

📚Data Preprocessing

Remove duplicate values

df = df.drop_duplicates(keep='first')

Split into Independent and dependent variable

So, basically all we need from this dataset to train our spam detection model are the class and message columns. Let’s just grab those two and make it our new dataset.


x = df['EmailText'].values
y = df['Label'].values

Text Pre-Processing

creating a function to lowercase the text, remove special characters, normalize certain words, use stems of words instead of the original form using porter stemmer algorithm

porter_stemmer=PorterStemmer()
def preprocessor(text):

text=text.lower()
text=re.sub("\W"," ",text)
text=re.sub("\s+(in|the|all|for|and|on)\s+"," _connector_ ",text)
words=re.split("\s+",text)
stemmed_words=[porter_stemmer.stem(word=word) for word in words]
return ' '.join(stemmed_words)

creating tokenizer function to

Create a space between special characters

Split based on whitespace

# new
def tokenizer(text):
text=re.sub("(\W)"," \1 ",text)
return re.split("\s+",text)

Feature extraction

To use text data for predicting stuff, you gotta break it down and get rid of some words — that’s called tokenization. Then you gotta turn those words into numbers, either integers or floating-point values, so you can use them in machine learning. That whole thing is known as feature extraction (or vectorization).

CountVectorizer from Scikit-learn is like a cool tool that turns a bunch of text into a bunch of numbers by counting the words. You can also clean up the text before turning it into numbers. It’s a super handy feature for working with text data.

Count Vectorizer is used to transform a corpus of text to a vector of term.

min_df = 0.06 ( taking 0.6% of Minimum Document Frequency )

ngram_range=(1,2) ( word level Unigrams and bigrams) (NEW)


vectorizer = CountVectorizer(tokenizer=tokenizer,ngram_range=(1,2),min_df=0.006,preprocessor=preprocessor)
x = vectorizer.fit_transform(x)

Check Data is imbalanced

sns.countplot(df['Label'])

target class has an uneven distribution of observations, So we are using random over sampling method to balance the target variable’s observation

Random Oversampling: randomly duplicate examples in the minority class (Spam)

# NEW
from imblearn.under_sampling import NearMiss
from collections import Counter
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

print('Original dataset shape', Counter(y))

# fit predictor and target
x,y = ros.fit_resample(x, y)

print('Modified dataset shape', Counter(y))
Original dataset shape Counter({'ham': 4516, 'spam': 653})
Modified dataset shape Counter({'ham': 4516, 'spam': 4516})

📚Data Splitting

x_train , x_test , y_train , y_test   = train_test_split(x, y, test_size =0.2,random_state = 0)📚Model Training and Testing

📚Model Training and Testing

NB Model

MultinomialNB()
clf = MultinomialNB()
clf.fit(x_train,y_train)

Accuracy

y_pred_NB = clf.predict(X_test)
NB_Acc=clf.score(X_test, y_test)
print('Accuracy score= {:.4f}'.format(clf.score(X_test, y_test)))

Now let’s test this model by taking a user input as a message to detect whether it is spam or not:

sample = input('Enter a message:')
data = cv.transform([sample]).toarray()
print(clf.predict(data))

SVM

SVC(C=1, kernel='linear')
model = SVC(C =1,kernel = "linear" ) model.fit(x_train,y_train)

Accuracy

accuracy = metrics.accuracy_score(y_test, model.predict(x_test))
accuracy_percentage = 100 * accuracy
accuracy_percentage

Hyperparameter Optimization using Grid Search CV(New)

SVM

params  = {"C":[0.2,0.5] , "kernel" : ['linear', 'sigmoid'] }
cval = KFold(n_splits = 2)
model = SVC();
TunedModel = GridSearchCV(model,params,cv= cval)
TunedModel.fit(x_train,y_train)
GridSearchCV(cv=KFold(n_splits=2, random_state=None, shuffle=False),
estimator=SVC(),
param_grid={'C': [0.2, 0.5], 'kernel': ['linear', 'sigmoid']})
accuracy = metrics.accuracy_score(y_test, TunedModel.predict(x_test))
accuracy_percentage = 100 * accuracy
accuracy_percentage
# 99.0038738240177

NB

from sklearn.model_selection import KFold, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
params = {
'alpha': [0.1, 0.5, 1.0], # Different values for alpha
'fit_prior': [True, False] # Whether to fit class prior probabilities
}

cval = KFold(n_splits=2)
model = MultinomialNB() # Using Multinomial Naive Bayes
TunedModel1 = GridSearchCV(model, params, cv=cval)
TunedModel1.fit(x_train, y_train)
accuracy = metrics.accuracy_score(y_test, TunedModel1.predict(x_test))
accuracy_percentage = 100 * accuracy
accuracy_percentage
96.40287769784173

Model Evaluation

Confusion-svm


sns.heatmap(confusion_matrix(y_test,TunedModel.predict(x_test)),annot = True , fmt ="g")
plt.xlabel("Predicted")
plt.show("Actual")
plt.show()

classification report-SVM

print(classification_report(y_test,TunedModel.predict(x_test)))

Using the trained model, predict whether the following five emails are spam or ham

mails = ["Hey, you have won a car !!!!. Conrgratzz"
,"Dear applicant, Your CV has been recieved. Best regards"
,"You have received $1000000 to your account"
,"Join with our whatsapp group"
,"Kindly check the previous email. Kind Regard"]
for mail in mails:
is_spam = TunedModel.predict(vectorizer.transform([mail]).toarray())
print(mail + " : " + str(is_spam))
Hey, you have won a car !!!!. Conrgratzz : ['spam']
Dear applicant, Your CV has been recieved. Best regards : ['spam']
You have received $1000000 to your account : ['spam']
Join with our whatsapp group : ['spam']
Kindly check the previous email. Kind Regard : ['ham']

🔑 Results

NB Model gave 95% accuracy and SVM=98%

After model tunning

NB Model gave 96% accuracy and SVM=99%

Github

Here you can find the complete code of project

Link of Github

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at

Machine Learning projects course

🔍 Explore Free world top University computer Vision ,NLP, Machine Learning , Deep Learning , Time Series and Python Projects, access insightful slides and source code, and tap into a wealth of free online websites, github repository related Machine Learning Projects. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your Machine Learning projects potential!”

Stay tuned for our upcoming articles because we reach end to end ,where we will explore specific topics related to Deep Learning in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️

📚GitHub Repository

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter

📚References

  1. How to Save a Machine Learning Model?
  2. What’s the difference between fit and fit_transform in scikit-learn models?
  3. Multiclass Text Classification Notebook.ipynb(unread)

--

--