Mobile Phone Spam Filtering with the Naïve Bayes Algorithm — (Part 1)

Rizka Yolanda
7 min readJan 9, 2019

--

“A classification is a definition comprising a system of definitions.” — Karl Wilhelm

As the worldwide use of mobile phones has grown, a new avenue for electronic junk mail has opened for disreputable marketers. These advertisers utilize Short Message Service (SMS) text messages to target potential consumers with unwanted advertising known as SMS spam. This type of spam is particularly troublesome because, unlike e-mail spam, many cellular phone users pay a fee per SMS received. Developing a classification algorithm that could filter SMS spam would provide a useful tool for cellular phone providers.

Data Collection 📝

To develop the classifier, we will use data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.
This dataset includes the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham. Some examples of spam and ham are shown in the following :
Sample of ham SMS

  • If he started searching, he will get job in few days. He has great potential and talent.
  • I got another job! The one at the hospital, doing data analysis or something, starts on Monday! Not sure when my thesis will finish.
  • Better. Made up for Friday and stuffed myself like a pig yesterday. Now I feel bleh. But, at least, its not writhing pain kind of bleh.

Sample of spam SMS

  • Congratulations ur awarded 500 of CD vouchers or 125 gift guaranteed & Free entry 2 100 wkly draw txt MUSIC to 87066.
  • December only! Had your mobile 11mths+? You are entitled to update to the latest colour camera mobile for Free! Call The Mobile Update Co FREE on 08002986906.
  • Valentines Day Special! Win over £1000 in our quiz and take your partner on the trip of a lifetime! Send GO to 83600 now. 150 p/msg rcvd.

Variable and its data type 📝

SMS messages data has two features of variables, type and text. The SMS type has been coded as either ham or spam. The text element stores the full raw SMS text. We used both of this variables to see is it right or not coded using classification. The data type of both variable are char, but the type variable element is currently a character vector.

SMS Messages Data

Population and Sample📝

The population is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The amount of samples in each class are 747 for spam and 4827 for ham, so the total number of samples are 5574.

Naïve Bayes Algorithm for Predictive Classification

To predictive the classification we gonna use Naive Bayes method. The Naïve Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem but with strong assumptions regarding independence. Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. Although it is often outperformed by other techniques, and despite the naïve design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, it is definitely a machine learning algorithm to have in your toolkit.

Why we use Naive Bayes? Since Naive Bayes has been used successfully for e-mail spam filtering, it seems likely that it could also be applied to SMS spam. However, relative to e-mail spam, SMS spam poses additional challenges for automated filters. SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify whether a message is junk. The limit, combined with small mobile phone keyboards, has led many to adopt a form of SMS shorthand lingo, which further blurs the line between legitimate messages and spam. Let’s see how a simple Naive Bayes classifier handles these challenges.

Naïve Bayes Assumption

The reason the algorithm is called naïve is because it is assumed that :
1.) Each feature is linearly independent of the other features, and
2.) Each feature in the dataset are as important as any other feature (often neither is the case).

Or in other words, with Naïve Bayes, we assume that the predictor variables are conditionally independent of one another given the response value. This is an extremely strong assumption. Fortunately enough Naïve Bayes is still a strong enough algorithm that has strong results even when this assumption is violated.

Main Characteristic of the Case for using Naive Bayes

Looking at the preceding samples messages, we can’t notice any distinguishing characteristics of spam. One notable characteristic is that two of the three spam messages use the word “free,” yet the word does not appear in any of the ham messages. On the other hand, two of the ham messages cite specific days of the week, as compared to zero in spam messages.

So, with Naive Bayes classifier will take advantage of such patterns in the word frequency to determine whether the SMS messages seem to better fit the profile of spam or ham. While it’s not inconceivable that the word “free” would appear outside of a spam SMS, a legitimate message is likely to provide additional words explaining the context. For instance, a ham message might state “are you free on Sunday?” Whereas, a spam message might use the phrase “free ringtones.” The classifier will compute the probability of spam and ham, given the evidence provided by all the words in the message.

Mathematical Calculation of Naïve Bayes

Bayesian probability incorporates the concept of conditional probability, the probabilty of event A given that event B has occurred [denoted as P(A|B)P(A|B)]. In the context of our attrition data, we are seeking the probability of the messages belonging to spam or ham class Ck (where Cspam=Spam Messages and Cham=Ham Messages) given that its predictor values are x1,x2,…,xpx1,x2,…,xp. This can be written as P(Ck|x1,…,xp)P(Ck|x1,…,xp).

The Bayesian formula for calculating this probability is

Where:

  • P(Ck)P(Ck) is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not. As we saw in the above section preparing our training and test sets, our prior probability of an employee attriting was about 16% and the probability of not attriting was about 84%.
  • P(X)P(X) is the probability of the predictor variables (same as P(Ck|x1,…,xp)P(Ck|x1,…,xp)). Essentially, based on the historical data, what is the probability of each observed combination of predictor variables. When new data comes in, this becomes our evidence.
  • P(X|Ck)P(X|Ck) is the conditional probability or likelihood. Essentially, for each class of the response variable (i.e. attrit or not attrit), what is the probability of observing the predictor values.
  • P(Ck|X)P(Ck|X) is called our posterior probability. By combining our observed information, we are updating our a priori information on probabilities to compute a posterior probability that an observation has class Ck.

We can re-write Eq. (1) in plain english as:

Although Eq. (1) has simplistic beauty on its surface, it becomes complex and intractable as the number of predictor variables grow. In fact, to compute the posterior probability for a response variable with m classes and a data set with p predictors, Eq. (1) would require m^p probabilities computed.

How to running the naive bayes method in R ? Check it out on on my next post, here ! 😁

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

References

Almeida, Tiago A., and José María Gómez Hidalgo. 2011. “SMS Spam Collection.” Federal University of Sao Carlos. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.

Ingo Feinerer, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software. doi:10.18637/jss.v025.i05.

Lantz, Brett. 2015. Machine Learning with R. Birmingham, United Kingdom: Packt Publishing.

--

--