Guide to spam classification using nltk library, stemming, and bag of words.

Aastha Mathur
GoDataScience
Published in
6 min readSep 13, 2020

When I was 14, I remember that was the age of yahoo mails. One day I received a mail claiming that I have won a car and all I have to do is to submit the money first to move forward.

The 14-year-old me started jumping in joy and shouting all around. This was an unusual event and everyone in my family came together. Then suddenly my uncle burst my bubble of happiness in 2 minutes. He told me, these are fake emails to grab money through illegal methods.

Recently, I watched Jamtara on Netflix(a must-watch series though!). I got intrigued by the series and decided to deep dive into combating spamming emails. As I am in the field of Data Science I decided to create a guide on this.

Getting into the basics-

To begin with, let’s first understand NLP:

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

First things first, Spam Classifier falls under the category of Supervised Learning and it is a Classification Problem because the output is either Spam or ham(not spam).

Coming back there are many things to learn and work:

1.nltk library: It is the platform that can help us work with human language, working with fundamentals of writing programs, working with the corpus (paragraph, sentences), categorizing text, analyzing linguistic structure, and more.

#To import nltk library 
import nltk
nltk.download('stopwords')

Stopwords, are the words that have no significance in giving the sentence a meaning but just help in forming it so that they make sense. To make data processing easier we eradicate them.

For example: “Hurray!! You have won a gift hamper worth 5000.” As you can analyze that ‘you, have, a’ are of no significance they are just adding weight-age to our data.

2. Stemming and Lemmatization: Coming in terms with the acronym(slang) that we people use these days is just hysterical to me as a person and also to a machine, so NLP has come up with the concept of stemming and lemmatization, let’s dig a little bit about it.

from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()# also try with lemmatizer

A word with the same meaning can be written in different ways which makes it tough for the machine to understand thus stemming and lemmatization works on the same line is that they find the stem of the concerned word. For Example:

Playing, plays, played → play

One thing that should come under your notice is the difference between the two:

  • While stemming is just concern with giving you the stem word irrespective of its meaning, whereas lemmatization will give you a word that makes sense. For example:

In stemming, history, historical will have the stem word as histori

In lemmatization, the stem word will be history

  • Another one, the processing time for lemmatization is naturally more than the processing time for stemming because it gives you a word with a meaning (To find meaning in life we need time guys!)

3. Swiftly moving from my lame comments to another concept which is Vectorization, we need to first convert the paragraph/ sentences to words to vectors so that the machine can understand it.

To sum up: Tokenization → Bag of words → Vectors

Now, again there is more than one way to do it but here one can go with BOW(Bag Of Words) or TF-IDF(Term Frequency-Inverse Document Frequency) because here our data set is not large, go with Word2Vec or Sent2Vec when the dataset is large.

#creating bag of words
from sklearn.feature_extraction.text import CountVectorizer
#(use tf-idf model too)

Not going into the mathematical interpretation but summing up the idea of both,

BOW gives the representation of text into the simplest form. It creates a sequence of every word present in the sentence and gives equal weight (1 if the word is present and 0 if not).

Eg: Consider these sentences-

  1. FRIENDS is an all-time fav TV series.
  2. This TV series is still not liked by many.
  3. FRIENDS is available on Netflix.

We are now building a vocabulary based on all the unique words from the above sentences.

The vocabulary consists of these 17 words: ‘FRIENDS’,’ is’,’ an’,’ all’,’time’, ’ fav’,’TV’,’ series’, this,’ still’,’ not’,’ liked’,’by’,’ many’,’ available’,’ on’,’Netflix’

We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s. This will give us 3 vectors for 3 sentences:

Vector of Sentence 1:[11111111000000000]

Vector of Sentence 2:[01001011011111000]

Vector of Sentence 3:[11000000000000111]

So this is the core idea behind the bag of words, now as you can see there is no ordering of words which is not making sense also any addition of word will keep on increasing the vector size.

The importance of a word is not visible through this process.

TF-IDF, on the other hand, is Term Frequency–Inverse Document Frequency is a numerical statistic that reflects the importance of a word in each sentence, unlike BOW.

Not going into its mathematical explanation in this, because in my project I have used BOW as over here order doesn’t have a great role to play.

4. Data Visualization: Now, after all the Data Cleaning we will move forward to Data Visualization.

Summary statistics is not the only measure to understand data we need to visualize it for better understanding.

As you can see I have plotted the graph of Message Length Class v/s Frequency. It checks the frequencies of different message class lengths.

  • As we can interpret the frequency of getting messages of length between 0 to 200 is maximum and it keeps on decreasing as the message length keeps on increasing.
  • A shorter mail is more likely to receive attention and response than a longer mail. The people reading have so much to dig through that, they’re likely to discard the message rather than read through the whole thing.

5. The final step is prediction. Here I am using Random Forest Classifier, it is the first step towards machine learning algorithms, it is an Ensemble Technique, which uses base learner (Decision Trees) where each base learner has a sample chosen by sampling with replacement technique and then the final output is given by the majority of all the base learners.

Output:

The utility of this project is we can deploy it and can check if the message or mail received is a spam or not.

I have the code for this but like me, I want you to do it first by yourself, my purpose for writing this article was to share my insights into this project. Trust me there are amazing Youtube videos and articles out there for the same. Do explore them too!

Thank you for reading this.💫

LinkedIn ,email : aasthamathur2510@gmail.com

--

--

Aastha Mathur
GoDataScience

Masters in Statistics, Delhi University. An avid learner, Data Science Enthusiast. Python Programming.