SMS Spam Classifier (Natural Language Processing)

Pulkit Khandelwal
Analytics Vidhya
Published in
8 min readFeb 17, 2021

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language

Today, I will explain you how to Build a SMS Spam Classifier using Machine Learning in Python

We will use algorithms such as :-

  • Multinomial Naive Bayes Classifier
  • Support Vector Machine

For this project, we will use python libraries like:-

  • NLTK (Natural Language Toolkit)
  • Scikit-Learn
  • Pandas

**Jupyter Notebook can be found on this github account**

Approach :-

1. Reading the Data

To read the data, import the Pandas library and read the data using pd.read_csv() .

Here the data in the file is Tab (\t) separated, so we must provide the “sep” (separate) parameter. Also, the file does not contain any column names, so we should provide column names using “names” parameter.

Data is getting stored in a DataFrame named “messages”.

To view the first 5 rows of data, we should use messages.head()

We can see here that the first column contains the Labels (dependent variable) i.e., a message is Spam or Ham and the second column contains the actual Messages (independent variable)

2. Exploratory Data Analysis

We can see here that there are 5572 rows and 2 columns. It means that there are 5572 messages and 2 columns named “Label” and “Message”.

There are no missing values in the data.

.value_counts() helps to return total counts for each Category i.e. ‘Ham’ and ‘Spam’ . We can see that the ham messages are more than the Spam messages.

We see that 4825 out of 5572 messages, or 86.6%, are ham.
This means that any machine learning model we create has to perform better than 86.6% to beat random chance.

3. Data Preprocessing

Here, we are calculating the Length and Punctuation of each message for further analysis, and this is added to DataFrame (messages) as Column.

3.1 Text Cleaning

Messages contain text but with many punctuation, Stop-words (These are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence), special characters, and many forms of verb.

Now we will clean messages by removing the unnecessary things.

We will import re (regex library) and from nltk library, we will import stopwords and WordNet Lemmatizer (One such method of Lemmatization) and create its object.

Now we will iterate through every message, and using regex (substitute method), we will take everything from the message except small alphabets(a-z) and capital alphabets(A-Z) and substitute it with blank space. Next, we will lowercase the message (as abc is not the same as ABC) to make learning easy for the machine and then split it into words. Will pass split words in a list comprehension where we will check each word, that whether it exists in the stopwords collection by nltk , and if the word is not in stopwords, it will be lemmatized using WordNet Lemmatizer object. After each word is lemmatized, we will again join the words and form a sentence and append in the corpus list of sentences

Replacing the messages with the cleaned messages that is available in the corpus.

3.2 Analyzing the difference between Spam and Ham messages

Split into Spam messages and Ham Messages for further analysis by comparing the Label with the “spam” or “ham”.

Now we can see that spam_messages only have the spam labels and ham_messages only have the ham labels. This is just done to get better insights into the data.

We can see here that Spam messages have more average words than Ham messages. Same with Punctuation also, Spam messages have more average Punctuation than Ham messages.

4. Model Building

Splitting into ‘X’ , which contain the independent variables i.e., messages and ‘y’ include dependent variable (target variable) i.e., labels (spam or ham)

4.1 Train Test Split

Using the scikit-learn library, we can split the data into train and test. Here I have split the data into 77% (training data) and 33% (testing data)

4.2 Dealing with Text (Natural Language data) data

Now it’s time to talk about how to deal with the text data. We can’t directly pass the text to the machine learning model as the machine only understands data in the form of 0’s and 1’s.

To solve this problem, we will use the concept of TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency). It is a standard algorithm to transform the text into a meaningful representation of numbers and is used to fit the machine algorithm for prediction.

We can also use Count Vectorizer (Bag of Words), but Count Vectorizer does not put weights on words, unlike TF-IDF vectorizer.

We can use the TF-IDF vectorizer from the scikit-learn library. Next, create an object of TF-IDF vectorizer and fit_transform to the data, which will convert into a matrix of words and sentences.

Here, 3733 are the sentences of the X_train, and 5772 are the total words obtained from the sentences.

4.3 Pipelining

We are doing Pipelining as we need to perform the same procedures for the test data to get predictions; that may be tiresome.

However, what is convenient about this pipeline object is that it can perform all these steps for you in a single cell, which means you can directly provide the data. It will be both vectorized and run the classifier in a single step.

Note:- When we will predict custom text later, we can directly pass the custom text to Pipeline, and it will help to predict the label

If you don’t know about the Pipeline, it takes a list of tuple where each tuple takes the name set by you and calls any method you want to perform.

from sklearn.pipeline import Pipeline

Multinomial Naive Bayes Classifier

We will import the MultinomialNB model from the scikit-learn library. Next, we will create a model named “text_mnb” using Pipeline, where we first provided TfidfVectorizer() object and then MultinomialNB() object. It should be provided in a sequence as we want that firstly TfidfVectorizer should be executed and the output of it will be provided to the model, and at last, we fit the model with X_train and y_train.

Now every internal functionality will be handled by Pipeline and will perform the steps accordingly.

To make a prediction, we need to pass the X_test data, and the Pipeline object will handle it, i.e., automatically vectorize it and make predictions for us.

y_preds_mnb” contains the predictions from the X_test made by our model and reaching an accuracy of approx 97%, which is relatively better than random chance.

We must know that accuracy itself is not capable of justifying that the model is working fine. We will use scikit-learn library to get a report on confusion_matrix and classification_report

Here we can see that “ham” label got predicted good but “spam” label prediction is not fine , so we can’t say that model is excellent. Model is lacking in predicting spam accurately.

Let’s try out the same problem with SVM (Support Vector Machine)

Linear SVC (Support Vector Classifier)

Same steps will be performed as above, only difference is that we need to import LinearSVC from scikit-learn library

Created model named “text_svm” from a pipeline object that performs TfidfVectorizer and LinearSVC model creation and then fit the X_train and y_train to the model. “y_preds_svm” is the predictions of the X_test using model and accuracy is approx 98.69%, which is better than MultinomialNB Model.

Now let’s also see the evaluation metrics to get a better insight into how the model is performing.

We can see that “ham” got predicted amazing, and also the “spam” label prediction increased as compared to the “MultinomialNB” model.

Now the model is being created, let’s try out this on a custom text and see what the model predicts!

We provided the custom text and refined the text (removal of stopwords, punctuations, and performed lemmatization). We have done this above; it’s just that here the function is defined which takes the custom text to get everything removed.

Then we used SVM model, i.e. “text_svm” which had Pipeline containing TfidfVectorizer and LinearSVC Model function. We directly passed the refined word (cleaned word) to the model and it predicted the message as “Spam”.

Here we come at the end of the post, I hope you understood how to deal with Natural Language (text data) and how to use machine learning to solve the real world problem (Gmail uses ML to classify mails as Spam or Ham)

Happy learning!

--

--