Spam Classification using NLP

Sameer Kumar
Analytics Vidhya
Published in
6 min readSep 17, 2021

--

Introduction

Natural Language Processing(NLP) is a subset of Artificial Intelligence that helps the computers to understand, interpret and utilize human languages. It also provides computers the ability to read the text data, speech data etc and interpret them.

NLP has various applications in real world and some of them are:

  1. Sentiment Analysis
  2. Q/A Applications
  3. Text Summarization
  4. Machine Translation
  5. Spam classifiers
  6. Named entity recognition

These were just few of the examples that work on the foundation of NLP concepts. In this article, we will discuss a mini project on Spam Classifier using NLP techniques through a basic walkthrough which is used in NLP projects.

Following are the steps taken to solve a mini NLP project of spam classifier:-

  1. Tokenization.
  2. Removing stop words, punctuation marks and cases( Text Data Cleaning).
  3. Stemming/Lemmatization.
  4. Converting Words to Vectors by Bag of Words(BOW).
  5. Independent and Dependent Variable in dataset.
  6. Creating model and training the model.

Let us now understand each of the above steps in detail:

A] Tokenization

The first step in a Data Science project is to obtain the right Dataset. In case of a Spam classifier dataset, majority of the data including the input and output will be in textual format. Our first job is to convert those large volumes of texts into smaller pieces of token(smallest unit of language).

Raw Dataset Image

Tokenization is the process of breaking the large chunks of texts into smaller pieces of texts(token). For a paragraph, sentence is a token and for a sentence, word is a token.

What is the importance of tokenization?

Majority of the text pre-processing models like Bag of words, TF-IDF pre-process each and every word at a token level by converting them to vectors and the feeding it to the model. That is why we convert large texts into tokens.

B] Removing Stop words, Punctuations, cases etc (Data Cleaning of Text)

When we observe these huge amounts of text in our dataset, we have to clean those texts before converting them to vectors. Words like is, on, if, as, the, again, why etc appear quite often while dealing with textual data. Our job is to remove these stop words as they do not play a huge role while pre-processing along with the punctuation marks.

Before doing this, we first remove the unwanted columns and change the name of required columns.

Removing columns

Now we have the dataset with two columns ‘message’ which contains the incoming text and ‘label’ which will either be spam or ham.

Next step is to remove stop words, punctuation marks and convert them to lower cases otherwise same words with different cases will be treated differently.

Code for Text Data cleaning

We will use the NLTK library and import all the required classes to do the needful.

The for loop will run for each row in ‘messages’ column and it will first replace all characters which are not a-z or A-Z with space. Then it will lower the case of all words using lower() function. The next step is to use split() function on each row so that it will convert to list of words and then we could work on every word.

Last step is to apply lemmatization on those words which are not present in the list of stop words and then join it with the blank spaces.

C] What is Lemmatization?

Lemmatization is the process in which reduce all similar words to their root form by removing their suffix as it appears in dictionary. We do this to achieve uniformity as there will many words which will have the same root in our data.

Words like studies, studying gets converted to study.

Stemming is similar to lemmatization but stemming at times do not produce meaningful words. Therefore, Lemmatization is generally preferred in cases where the meaning of words is crucial like chatbots and Q/A applications whereas Stemming is heavily used in Sentiment Analysis.

D] Bag of Words(BOW)

Now that we have the cleaning process on text data, it is time to convert that text into numbers/vectors as the model does not understand human language, so we convert words into vectors by some techniques like Bag of Words, TF-IDF and Word embeddings.

Let’s talk about Bag of Words here.

What is BOW?

Bag of Words(BOW) is a technique where we first create a vocabulary of unique words present in the texts and represent each word by it’s number of appearances in the document.

Let’s understand this with an example. Let’s assume that one column contains reviews of a movie.

Review 1: The movie is funny.

Review 2: The movie is long.

Review 3: The movie is good.

First create vocabulary of unique words and then create vectors for each review containing their unique appearances in each row.

This was just a short example to explain the working BOW.

Disadvantages of BOW

  1. If the new sentences contain new words, then the vocabulary size will also increase and thereby increasing length of vectors too.
  2. Vectors will contain many 1 and 0 resulting in sparse matrix and sparse matrices are computationally inefficient.
  3. We retain no information on grammar and ordering of text.
  4. The importance is not given to more important words at times.

CountVectorizer is the class used for Bag of Words. While creating vectors, it may happen that the sizes of vectors may increase exponentially. To avoid that we use a parameter called max_features to avoid that problem.

Now we have finally converted the words into vectors. We now feed the vectors to the model for training purpose.

E] Dummy variable trap

Label is the target variable in our dataset. Label contains two classes: Spam and Ham. Model does not understand the meaning of spam and ham, so we use get_dummies function of Pandas to convert spam and ham into 1 and 0.

Dummy Variable trap

Now we also have the independent variable (X)(messages) and the dependent variable (y)(Label).

We now split the data into training and testing and provide the training data to the model. We will use Multinomial Naive Bayes and I will talk about this part in my next article.

So this was all about the Feature engineering part of this mini project in which we dealt with stopwords, punctuation marks, cases, stemming, lemmatization, Bag of words etc.

I hope you all liked this article!

I will come up with the last phase of this project very soon.

You can also connect with me on LinkedIn to read more posts and articles related to Data Science. The link will be provided below.

Happy Learning!!

--

--

Sameer Kumar
Analytics Vidhya

AI Research intern at SCAAI || Kaggle 2x Expert || Machine Learning || Deep Learning || NLP || Python