Spam Filtering Using Bag-of-Words

Introduction

Aditi Mukerjee
The Startup
7 min readDec 31, 2020

--

In this post, we’re going to employ one simple natural language processing (NLP) algorithm known as bag-of-words to classify messages as ham or spam. Using bag of words and feature engineering related to NLP, we’ll get hands-on experience on a small dataset of one SMS message, a lot of SMS messages, and email for SPAM/HAM classification.

SPAM/HAM email (photo credits: https://www.lucypark.kr/courses/2015-dm/svm.html)

The Problem: Spam Messages

Spam emails or messages belong to the broad category of unsolicited messages received by a user. Spam occupies unwanted space and bandwidth, amplifies the threat of viruses like trojans, and in general exploits a user’s connection to social networks.

Various techniques are employed to filter out spam messages, usually centered on content-based filtering. This is because specific keywords, links, or websites are repeatedly sent in bulk to users, characterizing them as spam.

Bag-of-Words Model

A bag-of-words model allows us to extract features from textual data. As we know, an algorithm doesn’t understand language. Thus, we need to use a numeric representation for the words in the corpus. This numeric representation can later be fed to any algorithm for further analysis.

The basic idea of bag-of-words (BoW) is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

Using a process which we will go through now, we can convert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrence of each word or token in that document.

Our objective here is to convert this set of text to a frequency distribution matrix. Here as we can see, the documents are numbered in the rows, and each word is a column name, with the corresponding value being the frequency of that word in the document.

This the frequency matrix created by the bag of words (https://www.ronaldjamesgroup.com/blog/grab-your-wine-its-time-to-demystify-ml-and-nlp)

To reach to the stage of frequency matrix generation as few preprocessing steps need to be done.

  • Convering all the words in lower case
  • Removing puntuation
  • Remove stop words. They often involve prepositions, helping verbs, and articles (i.e. in, the, an, is). Since these add no value to our model, we need to eradicate them.

To handle this, we will be using sklearns count vectorizer method which does the following:

  • It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.
  • It counts the occurrence of each of those tokens.
  • The CountVectorizer method automatically converts all tokenized words to their lower case form . It does this using the lowercase parameter which is by default set to True.
  • It also ignores all punctuation.
  • The third parameter to take note of is the stop_words parameter. By setting this parameter value to english, CountVectorizer will automatically ignore all words(from our input text) that are found in the built in list of english stop words in scikit-learn.

I have an excellent example in my GitHub repository showing how an SMS message has been used to create a frequency matrix.

Dataset

What we have here in our data set is a large collection of text data (5,572 rows of data). Most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.

Each message in a column, with the corresponding column next to it specifying whether the text is ham or spam. The data set and the code used are both saved here.

If you have any questions or comments or need any further clarifications please don’t hesitate to contact me at aditimukerjee33@gmail.com or reach me at 403–671–7296. If you are interested in collaborating on any project, feel free to reach out to me without any hesitation.

Importing the dataset

To import the dataset into a Pandas dataframe, we use the couple of lines written below:

Here’s a glimpse of the dataset we are working on.

Data Pre-processing

Label conversion

Converting the labels to binary variables, 0 to represent ‘ham’(i.e. not spam) and 1 to represent ‘spam’ for ease of computation.

Scikit-learn only deals with numerical values and hence if we were to leave our label values as strings, scikit-learn would do the conversion internally(more specifically, the string labels will be cast to unknown float values). Hence, to avoid unexpected ‘gotchas’ later, it is good practice to have our categorical values be fed into our model as integers.

Data Visualization

Here is a visual of the data:

Here is the list of features generated by CountVectorizer.

The matrix that represents the frequency of each of these features in our messages(dataset) is given below:

Frequency Matrix

Developing the Model

Now that our dataset is ready with its attributes, we pass it through any algorithm of our choice. Here, after splitting the dataset into training and test sets, I’ve used a simple Naive Bayes and Logistic Regression classifier for demonstration.

Results of Naive Bayes classifier

Let’s see how our simple model works on a test set using the Naive Bayes classifier.

It performs very well on test data with an accuracy of 98%.

Naive Bayes unlike other classification algorithms is able to handle an extremely large number of features. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them. The other major advantage it has is its relative simplicity. Naive Bayes’ works well right out of the box and tuning it’s parameters is rarely ever necessary. It rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle.

Here I have used a simple Logistic Regression classifier for demonstration.

Results of Logistic Regression classifier

Let’s see how our simple model works on a test set using the Logistic Regression classifier.

It performs very well on test data with an accuracy of 97.8% which is lower than Naive Bayers.

Here I have used a simple Logistic Regression classifier with Randomised Search CV for demonstration.

The performance of logistic regression with optimized hyperparamters is as good as naive bayers.

Now I used my own spam email to see the accuracy of the method in predicting spam or ham.

Here is my spam email.

“Hello aditi,Test your luck, Onezy gives you One Free Spin on the Bonus Wheel!Every spin will give you a great bonus!Spin the wheel!Do not miss this amazing opportunity,you only get One Shot to win a maximum of the $10 Welcome bonus and a 100% Deposit bonus. Terms and conditions apply.Play now!”

The same method was employed and I got a very high accuracy with it.

Conclusions

  • Naive Bayers performs well and gives an accuracy of over 98%.
  • This model also predicts that one of my emails is spam with a high accuracy.

Drawbacks of the Bag-of-Words Model

The bag-of-words model assumes that the words are independent. Thus, it doesn’t take into account any relationship between words. Hence, the meaning of sentences is lost.

Also, the structure of the sentence has no importance in the eyes of our model Two sentences like “These clams are good” and “Are these clams good?” mean the same to the of bag-of-words model, though one is a claims and one is a question. Additionally, for a large vocabulary, bag-of-words result in a very high-dimensional vector.

The dataset used in this model is available on my Github along with my code that is available for public use. If you have any questions or comments or need any further clarifications please don't hesitate to contact me at aditimukerjee33@gmail.com or reach me at 403–671–7296. If you are interested in collaborating on any project, feel free to reach out to me without any hesitation.

If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment below.

--

--

Aditi Mukerjee
The Startup

Engineer. Data Analyst. Machine Learning enthusiast