Mobile Phone Spam Filtering with the Naïve Bayes Algorithm — (Part 2)

Rizka Yolanda
10 min readJan 9, 2019

--

“A classification is a definition comprising a system of definitions.” — Karl Wilhelm

Based on my last post,we gonna classified the SMS Spam Collection data a as spam or not by using Naïve Bayes classifier. You guys can see my last post about this here.

Let’s running the R code for Naïve Bayes classifier!

1. Data Exploration

The first step towards constructing our classifier involves processing the raw data for analysis. Text data are challenging to prepare, because it is necessary to transform the words and sentences into a form that a computer can understand. We will transform our data into a representation known as bag-of-words, which ignores word order and simply provides a variable indicating whether the word appears at all.

We’ll begin by importing the CSV data and saving it in a data frame:

sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

Using the str() function, we see that the sms_raw data frame includes 5,574 total SMS messages with two features: type and text. The SMS type has been coded as either ham or spam. The text element stores the full raw SMS text.

str(sms_raw)

The type element is currently a character vector. Since this is a categorical variable, it would be better to convert it into a factor, as shown in the following code:

sms_raw$type <- factor(sms_raw$type)

Examining this with the str() and table() functions, we see that type has now been appropriately recoded as a factor. Additionally, we see that 747 (about 13 percent) of SMS messages in our data were labeled as spam, while the others were labeled as ham:

str(sms_raw$type)
table(sms_raw$type)

For now, we will leave the message text alone. As you will learn in the next section, processing the raw SMS messages will require the use of a new set of powerful tools designed specifically to process text data.

2. Data preparation — cleaning and standardizing text data

The first step in processing text data involves creating a corpus, which is a collection of text documents. In order to create a corpus, we’ll use the VCorpus() function in the tm package, we’ll use the VectorSource() reader function to create a source object from the existing sms_raw$text vector, which can then be supplied to VCorpus() as follows:

library(tm)
sms_corpus <- VCorpus(VectorSource(sms_raw$text))

The resulting corpus object is saved with the name sms_corpus. By printing the corpus, we see that it contains documents for each of the 5,574 SMS messages in the training data:

print(sms_corpus)

Because the tm corpus is essentially a complex list, we can use list operations to select documents in the corpus. To receive a summary of specific messages, we can use the inspect() function with list operators. For example, the following command will view a summary of the first and second SMS messages in the corpus:

inspect(sms_corpus[1:2])

To view the actual message text, the as.character() function must be applied to the desired messages. To view one message, use the as.character() function on a single list element, noting that the double-bracket notation is required:

as.character(sms_corpus[[1]])

To view multiple documents, we’ll need to use as.character() on several items in the sms_corpus object. To do so, we’ll use the lapply() function, which is a part of a family of R functions that applies a procedure to each element of an R data structure. The lapply() command to apply as.character() to a subset of corpus elements is as follows:

lapply(sms_corpus[1:2], as.character)

As noted earlier, the corpus contains the raw text of 5,574 text messages. In order to perform our analysis, we need to divide these messages into individual words. But first, we need to clean the text, in order to standardize the words, by removing punctuation and other characters that clutter the result. For example, we would like the strings Hello!, HELLO, and hello to be counted as instances of the same word.

The tm_map() function provides a method to apply a transformation (also known as mapping) to a tm corpus. We will use this function to clean up our corpus using a series of transformations and save the result in a new object called corpus_clean.

sms_corpus_clean <- sms_corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(removePunctuation) %>%
tm_map(stemDocument) %>%
tm_map(stripWhitespace)

3. Data preparation — splitting text documents into words

Now that the data are processed to our liking, the final step is to split the messages into individual components through a process called tokenization. A token is a single element of a text string; in this case, the tokens are words.

As you might assume, the tm package provides functionality to tokenize the SMS message corpus. The DocumentTermMatrix() function will take a corpus and create a data structure called a Document Term Matrix (DTM) in which rows indicate documents (SMS messages) and columns indicate terms (words).

Creating a DTM sparse matrix, given a tm corpus, involves a single command:

sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

This will create an sms_dtm object that contains the tokenized corpus using the default settings, which apply minimal processing. The default settings are appropriate because we have already prepared the corpus manually.

On the other hand, if we hadn’t performed the preprocessing, we could do so here by providing a list of control parameter options to override the defaults. For example, to create a DTM directly from the raw, unprocessed SMS corpus, we can use the following command:

sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list(
tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))

This applies the same preprocessing steps to the SMS corpus in the same order as done earlier. However, comparing sms_dtm to sms_dtm2, we see a slight difference in the number of terms in the matrix

4. Data preparation — Creating training and test datasets

With our data prepared for analysis, we now need to split the data into training and test datasets, so that once our spam classifier is built, it can be evaluated on data it has not previously seen.

We’ll divide the data into two portions: 75 percent for training and 25 percent for testing. Since the SMS messages are sorted in a random order, we can simply take the first 4,169 for training and leave the remaining 1,390 for testing. To confirm that the subsets are representative of the complete set of SMS data, let’s compare the proportion of spam in the training and test data frames:

prop.table(table(sms_train_labels))
prop.table(table(sms_test_labels))

Both the training data and test data contain about 13 percent spam. This suggests that the spam messages were divided evenly between the two datasets.

5. Creating a Word Cloud visualization

A word cloud will allow us to visually look at our corpus to quickly see the frequency of words within the corpus. Words that appear more frequently will be shown in a larger font size in the cloud, and less frequently a smaller font size. This can be done via the wordcloud R package.

library(wordcloud)
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)

We can also create visualizations of frequency for our raw sms data using the subset() function using the categorical feature $type:

par(mfcol = c(1, 2))
spam <- sms_raw %>%
subset(type == "spam")
spamCloud <- wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
ham <- sms_raw %>%
subset(type == "ham")
hamCloud <- wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

6. Creating indicator features for frequent words

To complete our data preprocessing, we must reduce the number of features in our test and training DT Matrices. To do this, we will use the findFreqTerms() function (again found in the tm package).

sms_dtm_freq_train <- sms_dtm_train %>%
findFreqTerms(5) %>%
sms_dtm_train[ , .]
sms_dtm_freq_test <- sms_dtm_test %>%
findFreqTerms(5) %>%
sms_dtm_test[ , .]

Now we shall write a function that converts our sparse DT matrices from numeric to categorical “yes/no” matrices that our algorithm can process.

convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}

Applying our convert_counts function:

sms_train <- sms_dtm_freq_train %>%
apply(MARGIN = 2, convert_counts)
sms_test <- sms_dtm_freq_test %>%
apply(MARGIN = 2, convert_counts)

7. Training a model on the data

Now comes the fairly straightforward step of training our model on the data and then using that classifier to make predictions on the test set. This requires the e1071 package to apply the naiveBays() function:

sms_classifier <- naiveBayes(sms_train, sms_train_labels)
sms_pred <- predict(sms_classifier, sms_test)

Evaluating Model Performance

Now we can use the CrossTable() function from the gmodels package to see how our predictions fared.

CrossTable(sms_pred, sms_test_labels, prop.chisq = FALSE, chisq = FALSE, 
prop.t = FALSE,
dnn = c("Predicted", "Actual"))
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1201 | 30 | 1231 |
## | 0.995 | 0.164 | |
## -------------|-----------|-----------|-----------|
## spam | 6 | 153 | 159 |
## | 0.005 | 0.836 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1207 | 183 | 1390 |
## | 0.868 | 0.132 | |
## -------------|-----------|-----------|-----------|

The overall accuracy rate of the model is 0.974 while 6 legitimate messages identified as spam and 30 spam messages could not be recognized. Legimate messages that labeled as spam (false positives) can cause serious problem since people can miss important information. We will work on to achieve better performance.

How to Improve Model Performance ?

Let’s attempt to improve our model performance by using a different laplace value for our classifier

sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_pred2, sms_test_labels, prop.chisq = FALSE, chisq = FALSE,
prop.t = FALSE,
dnn = c("Predicted", "Actual"))
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1390
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1202 | 28 | 1230 |
## | 0.996 | 0.153 | |
## -------------|-----------|-----------|-----------|
## spam | 5 | 155 | 160 |
## | 0.004 | 0.847 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1207 | 183 | 1390 |
## | 0.868 | 0.132 | |
## -------------|-----------|-----------|-----------|

In second revised model accuracy rate improved (%97.6). False positives reduced from 6 to 5 messages. We obtain an even better result of accuracy! But what did adding a Laplace Smoothing Variable do for us?

Laplace Smoothing to Improve Performance

That is smoothing? In a research paper from 1996, professors Chen and Goodman of Harvard University stated that “Smoothing is a technique essential in the construction of n-gram language models… [or] probability distributions over strings P(s)P(s) that attempts to reflect the frequrency with which each string ss occurs as a sentence in natural text.”(Chen and Goodman 1996)Smoothing simply avoids giving words that appear less frequently less weight than words that appear more frequently.

⬛Conclusion

We did show the high accuracy that the Naive Bayes algorithm has for classifying text base data, and that it is less accurate on numerical data. Adding the Laplace estimator reduced the number of false positives (ham messages erroneously classified as spam). It’s substantial considering that the model’s accuracy was already quite impressive.

⬛ Recommendation

We’d need to be careful before tweaking the model too much in order to maintain the balance between being overly aggressive and overly passive while filtering spam. Users would prefer that a small number of spam messages slip through the filter than an alternative in which ham messages are filtered too aggressively.

That’s all for my post about Naïve Bayes classifier. Hope it’s useful for all of you guys! Left your comment below and give me some claps 👏👏👏

References
Almeida, Tiago A., and José María Gómez Hidalgo. 2011. “SMS Spam Collection.” Federal University of Sao Carlos. http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.

Ingo Feinerer, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software. doi:10.18637/jss.v025.i05.

Lantz, Brett. 2015. Machine Learning with R. Birmingham, United Kingdom: Packt Publishing.

--

--