SMS Spam Part I | NLP Series 2
We all know that the internet and social media have become the quickest and most straightforward ways to get information. As a result, messages have become a significant source of information. In this era, Short message service or SMS is one of the most potent means of communication. As the dependence on mobile devices has drastically increased over the period, it has led to increased muggings via SMS. We can now extract meaningful information from such data using various artificial intelligence techniques thanks to technological advancements.
The main aim of this article is to understand how to build an SMS spam detection model.
Implementation
Now let’s implement SMS Spam classification using a dataset provided by the UCI Machine Learning community. The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according to ham (legitimate) or spam. You can read more about the dataset here: https://www.kaggle.com/uciml/sms-spam-collection-dataset.
Here’s a link to the dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset
First, we’ll load the required libraries. Here’s a list of libraries that we require:
- Numpy.
- Pandas.
- Regular Expression (re).
- Natural Language Toolkit (nltk).
- Stopwords from the nltk library.
- For stemming, we require PorterStemmer from the nltk library.
- Bag-of-words.
- Train Test split.
- Naive Bayes.
- Confusion Matrix.
- Accuracy Score.
We’ll load our dataset now, and let’s have a look at it. There are two columns; one is “label”, the other is the main message on which processing will happen (The main message). We will build a binary classification model to detect whether a text message is spam or not.
In this blog, I’ll only cover Stemming, and in the second part, we’ll look at Lemmatization and compare our results. Now, we will apply regular expression concepts to clean our data. After applying, our data will look something like this.
After Stemming, we’ll convert our data into vectors using the Bag-of-Words model. Then, finally, we have to convert our categorical variable (“label”) into numerical numbers to feed it to our model.
Now our dataset is ready to feed to the model. First, we’ll split our dataset using the train_test_split function. Then, in this blog, we’ll fit our data using the Naive Bayes algorithm only.
After fitting our data, let’s check the accuracy our model is providing.
Our model is giving is 98% accuracy. So it is an outstanding accuracy.
In the next blog, we’ll compare the accuracy with different classification algorithm, try Lemmatization and different algorithms. So stay tuned for the Part II of this blog.