Build a Spam Classifier using the Naive Bayes Algorithm

Sidharth Pandita
hackerdawn
Published in
3 min readMay 29, 2021
Photo by Solen Feyissa on Unsplash

Do you often receive emails saying that you have won $1 million or free mobile recharges for life? These emails are generally spam and are sent in bulk to users to trick them. In this story, we’ll build a classifier that will mark emails as spam or non-spam based on the text that they contain. We will use the Spam Mails Dataset from Kaggle to train the classifier.

Importing Libraries

Let’s first import the required libraries. If you don’t have a particular library installed, run the command ‘pip install <package_name>’ to install it.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Loading the Dataset

Let’s load the dataset we downloaded from Kaggle.

data = pd.read_csv('spam_ham_dataset.csv')

Splitting the Dataset

We will split the dataset for training and testing purposes.

X_train, X_test, Y_train, Y_test = train_test_split(data['text'],
data['label_num'])

Let’s transform the text data into vectors using CountVectorizer. This is done because our model cannot understand the text directly but vectors.

vectorizer = CountVectorizer().fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)

Creating the Model

We will create our model using the Multinomial Naive Bayes algorithm. We’ll then fit the model using the training data.

model = MultinomialNB(alpha=0.1)
model.fit(X_train_vectorized, Y_train)
Output

Prediction

Let’s predict the testing data and see how our model performs. We can see in the output that our model has an accuracy of 97.75 %, which is really great.

predictions = model.predict(vectorizer.transform(X_test))print("Accuracy:", 100 * sum(predictions == Y_test) / len(predictions), '%')
Output

Let’s feed a custom email to the model for prediction. As visible in the output, the model predicts the label as 0 (non-spam) which is totally right.

model.predict(vectorizer.transform(
[
"Hello Mike, I can came across your profile on Indeed, are you available for a short chat over the weekend.",
])
)
Output

Let’s feed another email to the model for prediction. The model predicts the label as 1 (spam) which is again right. This is amazing!

model.predict(vectorizer.transform(
[
"Congratulations, you have won the lucky draw. You are entitled to free recharge coupons for life.",
])
)
Output

We have successfully built the email classifier using the Naive Bayes algorithm. If you liked this tutorial, hit the Follow button to join the community!

--

--