Naive Bayes : Text Classifier for Spam Detection.

Naveen Kumar K
3 min readJan 9, 2019

--

Naive Bayes : Text processing and Spam Detection in R & Python with Naive Bayes.

In this Article we will try to get an intuitive understanding of the Naive Bayes classifier which is one of the Algorithm mostly used in Text classification or Document Classification Involving high dimensional data sets, It is widely used in applications of Text Analytics like Spam Detection, Sentimental Analysis etc.

The Foundation of Naive Bayes is the Bayes Theorem which works on a concept of “ Deductive Reasoning “ : This means, By having a knowledge of the conditions that are related to the event we will be trying to predict the occurrence of the event. ( ie.,, By having a knowledge of keywords like “Credit Card”, “ Subscribe” etc which are mainly found in spam emails, We will try to find the probability of an email being Spam or not when these words occur in the email. )

Packages used in R : For Text processing (tm, Snowballc), Generating Word Cloud (wordcloud), Generating Cross-tables (gmodels), Building Naive Bayes model (e1071)

Libraries used in Python : For Text Processing ( NLTK : Natural Language Tool Kit ), Building Naive Bayes model ( sklearn )

It is a Simple and Easy to implement model which will give a very good performance when we have small amount of text data, However we need to be careful to have all the combinations of class ( Spam, NoSpam )& attribute(Credit Card, Subscribe, discount) combination, If we miss even one of this combination then the Frequency based probability estimator will value to Zero.( We can use Laplace/Lidstone transformation to overcome the same )

Intuition : Imagine your are receiving an email to your inbox and the algorithm wants to classify it as a Spam or Not Spam Category, In this case it looks for certain keywords present in the email and based on this it will classify the email. This is the basis of Spam filter algorithms.

Initially let us assume one case of keyword “ Credit Card”.

From the training data when we build a model the algorithm will learn as to how many emails having the keyword “ Credit Card” and what is the probability of those email being Spam and NotSpam.

It will create a Frequency Table for the occurrences (Y/N)of “Credit Card” and classified as Spam and occurrences(Y/N) of “Credit Card” and classified as NotSpam, and finds the probability of an email being a Spam when the “ Credit Card” Keyword occurs.

The same is repeated for all the keywords (“Credit Card”, “ Subscribe” etc)and respective probabilities are calculated. Based on the conditional probability formula of Bayes Theorem a new email will be classified as a Spam or Not Spam depending on the occurrence of the Keywords in the email.

In the case we are having below we have a SMS Text Data with around 5500 SMS texts in which nearly 86% are Normal Messages and 14% are spam, Our goal is to build a classifier to do the classification.

Please note that the TEXT MINING process to clean the data before modeling is not shared in this post for which I will make a separate article and I will share. ( It will include the process of Tokenization, Stemming, Lemmatization, Creating a Documentation Term Matrix, TFIDF — Term Frequency & Inverse Document frequency etc )

Please find the detailed code for the practical cases in the below github links

Detection and Classification of Spam SMS messages using R Programming

https://github.com/naveenkumark1/Machine_Learning_R/tree/master/Naive_Bayes_(Text_Classification_Spam_Detection)

Detection and Classification of Spam SMS messages using R Programming

https://github.com/naveenkumark1/Machine_Learning_Python/tree/master/Naive_Bayes_Text_Processing_SMS_Spam_Filtering

Hope the Above code and data will be useful when you are trying to implement the learning

About the Author :

Naveen Kumar K : is a Data Scientist, Analytics Consultant & Learner. He likes to find solutions to problems involving Data, He also Teaches Statistics, Programming, Machine Learning, Visualisation & Other Data Science Subjects .

Stay in touch with me at : http://www.linkedin.com/in/naveenkreddy

--

--