Spam Detection à la Blade Runner

It’s 2019. Detective Rick Deckard of the San Francisco Police Department specializes in identifying, pursuing, and eliminating replicants. Replicants are androids indistinguishable to humans in all ways except one. They lack empathy at levels detectable by using a specialized measurement framework, called the “Voight-Kampff Empathy Test”. As Deckard sits across Rachael, he begins the administering the test which measures responses in physical cues to a battery of behavioral questions such as

You’ve got a little boy. He shows you his butterfly collection plus the killing jar. What do you do?”

The end of this tests classifies subjects either as Human or Replicant.

This scene reminds me of what’s happening in every form of human-to-human medium of communication. Where there’s an innovation: the newspaper, the telephone, facebook, tinder.. You’ll always find one thing: Replicants …. whoops, I mean spam.

Do we have a Voight-Kampff test today? Well yes. It’s called Natural Language Processing!

It works slightly differently from the Voight Kampff test though. And I’ll be going over it here with a simple a data set of 5,574 messages (source). The goal is to train our model which we’ll call “Deckard” for fun to identify spam (our replicant- generated messages). The dataset has about 754 cases of spam and 4,820 cases of human-generated messages.

We’ll approach the problem this way:

Split the dataset randomly into training records, validation records, and testing records. We’ll use the training records to teach deckard who’s spam and who’s not. Then we’ll let Deckard loose on the validation data. Here we’ll identify areas of improvement to fine tune our model. Finally, when we feel that Deckard’s ready for prime-time, we’ll let him loose on the test data, which we’ll hold out as unseen data to truly test his replicant-identification skills.

So, you’re probably asking — how are we going to train Deckard?!

Using a method called “latent semantic indexing” (LSI) to detect “topics” in the text messages. Then “linear discriminant analysis” (LDA) to detect how topics relate to class (spam or not spam).

LSI is a great approach because it tackles the issue of polysemy (different concepts can be expressed by a single word) and synonomy (different words can express a single concept). It does this by looking at words across documents and how different words are used in similar contexts and reduces these contexts into what’s called “topics”.

Overall the approach plays out this way:

  1. Choose the features we’ll focus on in the training dataset (what words should we include? what to exclude? Should we include symbols?_
  2. Take each document. Break it out into individual words. Create a frequency table of every unique word’s that exist across different messages. For example, you can imagine the word “The” showing up in many messages and perhaps multiple times.
  3. Create weights for each word based on importance — where rare words are given more weight. Show each word’s importance as it relates to documents.
  4. Group these words into K distinct “topics” Each topic represents a group of words and documents which are distinct from others.
  5. Build a predictive model that looks at the K and the outcomes (spam or not spam)using LDI.

In the first model, we remove common words “a for the”… and remove symbols. We run it telling the model 300 topics. Not bad Deckard!

We’re getting 97% accuracy right of the bat. But looking at the other measures — things can improve. Especially in the ability to capture all the spam that truly exists, also called “Recall”. This means, that Deckard is really good at detecting overall humans and spam (97%), but if you were give 100 spam messages, Deckard would only be able to detect 83% of them correctly. This likely has to do with how imbalanced the dataset is (only 13.5% of it is spam!).

Let’s try improving the model more and run different scenarios on the validation data.

We’ll run all the different combinations of models based on the following list of feature selections:

  • Custom stop words, Common English Stop words list, and no stop words list
  • Remove all Symbols, Include all symbols.

Results:

Seems that the second option is the best! Using our custom stop words list and removing symbols.

Let’s let Deckard loose and see how he performs on the unseen test data.

We’re able to squeeze a little bit more performance out of our model.

This is a decent first-line defense against spam and it’s important to continue to improve this model as well as use additional strategies. Both replicants and spammers are continually improving their methods and even the best “Voight Kampff Empathy Test” may not be able to detect the sneakiest replicants!

Check out the code I use to build this, located on my github!

https://github.com/chrisgian/Capstone1-Spam-Detection-NLP

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade