Classify emails into ham and spam using Naive Bayes Classifier

What are we building?

Published in

The Startup

5 min readJan 16, 2018

We’ll build a simple email classifier using naive Bayes theorem. Algorithm implemented in PHP can be found here- https://github.com/varunon9/naive-bayes-classifier

A little bit introduction-

From wikipedia:
P(A | B) = P(B | A) * P(A) / P(B) where A and B are events and P(B) != 0
P(A | B) is a conditional probability: the likelihood of event A occurring given that B is true.
P(B | A) is also a conditional probability: the likelihood of event B occurring given that A is true.
P(A) and P(B) are the probabilities of observing A and B independently of each other, this is known as the marginal probability.

Now lets assume that we have few documents which are already classified as spam or ham (training set). So the problem that “is this email ham or spam” can also be stated as- What is the probability that latest email is ham or spam given that it contains following document? (Here document is some text in email). Mathematically we can have-

P(ham | bodyText) = Probability that email is ham given that it contains document- bodyText (lets say bodyText = content of email)
P(spam | bodyText) = Probability that email is spam given that it contains document- bodyText
P(ham | bodyText) = (P(ham) * P(bodyText | ham)) / P(bodyText)
P(spam | bodyText) = (P(spam) * P(bodyText | spam)) / P(bodyText)

Preparing Training set-

We must have a training data set for our classifier to work. We’ll use MySql as database to store our training set. Lets start with database schema.

We’ll be creating two tables.

trainingSet with columns document(text) and category(varchar). This table will hold all the emails with their category i.e. ham or spam.
wordFrequency with columns word(varchar), count(int) and category(varchar). This table will hold all the words seen so far along with their count and category.

Lets train our classifier with following data set. (I’ll be using https://github.com/varunon9/naive-bayes-classifier application to train our classifier).

Have a pleasurable stay! Get up to 30% off + Flat 20% Cashback on Oyo Room bookings done via Paytm. (SPAM)
Lets Talk Fashion! Get flat 40% Cashback on Backpacks, Watches, Perfumes, Sunglasses & more. (SPAM)
Opportunity with Product firm for Fullstack | Backend | Frontend- Bangalore. (HAM)
Javascript Developer, Full Stack Developer in Bangalore- Urgent Requirement. (HAM)

Database after training with first data set.

Database after training with first and second data sets.

trainingSet table data after training with all 4 data sets.

wordFrequency table data after training with all 4 data sets.

Implementing the Algorithm-

Our classifier will implement following pseudo code-

if (P(ham | bodyText) > P(spam | bodyText)) {
return ‘ham’;
} else {
return ‘spam’;
}

P(ham | bodyText) = (P(ham) * P(bodyText | ham)) / P(bodyText)
P(spam | bodyText) = (P(spam) * P(bodyText | spam)) / P(bodyText)
Since P(bodyText) is constant and common in both expressions, we can avoid it. Note that our goal is not to calculate the actual probability but only comparison.
P(ham) = no of documents belonging to category ham / total no of documents
P(spam) = no of documents belonging to category spam / total no of documents
To calculate the above two probabilities, we’ll use trainingSet table.
P(bodyText | spam) = P(word1 | spam) * P(word2 | spam) * …
P(bodyText | ham) = P(word1 | ham) * P(word2 | ham) * …
To calculate the above two probabilities, we’ll use wordFrequency table. Here word1, word2, word3 up to word-n are total words in bodytext.
P(word1 | spam) = count of word1 belonging to category spam / total count of words belonging to category spam.
P(word1 | ham) = count of word1 belonging to category ham / total count of words belonging to category ham.

Things to note-

What would happen if our classifier detects a new word that is not present in training data sets? In that case P(new-word | ham) or P(new-word | spam) will be 0 making all product equal to 0.

To solve this problem, we will take log on both sides. New pseudo code will be-

if (log(P(ham | bodyText)) > log(P(spam | bodyText))) {
return ‘ham’;
} else {
return ‘spam’;
}

log(P(ham | bodyText)) = log(P(ham)) + log(P(bodyText | ham))
= log(P(ham)) + log(P(word1 | ham)) + log(P(word2 | ham)) …

But wait, our problem is still not solved. If our classifier encounters a new word that is not present in our training data sets then P(new-word | category) will be 0 and log(0) is undefined. To solve this problem, we’ll use Laplace smoothing. Now we’ll have-

P(word1 | ham) = (count of word1 belonging to category ham + 1) / (total count of words belonging to ham + no of distinct words in training data sets i.e. our database)

P(word1 | spam) = (count of word1 belonging to category spam + 1) / (total count of words belonging to spam + no of distinct words in training data sets i.e. our database)

To further improve the classifier, we can tokenize the bodyText i.e content of email.

Testing our classifier-

$classifier -> classify(‘Scan Paytm QR Code to Pay & Win 100% Cashback’); // spam

$classifier -> classify(‘Re: Applying for Fullstack Developer’); // ham

Conclusion-

Naive Bayes classifier is easy to implement and provide very good result provided that our training data set is good. It can also be used to classify mood i.e. happy/sad/neutral or to classify emotions from tweets i.e. positive/negative/neutral etc. In case you find any error or problem, please create github issue at https://github.com/varunon9/naive-bayes-classifier.