Analytics Vidhya
Published in

Analytics Vidhya

Spam Mail Filtering with DeepLearning4J

Document classification is one of the common use cases in the domain of Natural Language Processing (NLP) and well applied in many applications. This example demonstrate document classification with the use case of spam mail filtering. The results shows that by using Deep Learning, we can strategically filter out most of the spam mails based on the context.


Fig 1. Workflow of Spam Filtering Implementation

The workflow of spam filtering is shown in Fig 1. The workflow started with data cleansing and restructuring to prepare data into a ready format for training.

The original dataset can be retrieved from here. In the uncompressed folder, you can see that the whole dataset is in SMSSpamCollection.txt. In the file, there are labels with the word “ham” and “non-spam”, which correspond to spam and non-spam data.

Figure 2. Distribution of Dataset

The distribution of the dataset is displayed in the chart above. This dataset is unbalanced with more non-spam data points (4827 samples) compared to spam data points (747 samples). In this example, the unbalanced data is modelled with classification model. There is a caveat where the network might adapt to patterns of non-spam mail better due to the volume of the data. The performance may be better improved with a balanced dataset.

The data processing step starts by reading in the original text file — SMSSpamCollection.txt. The text file contains multiple lines of text where each string of text constituted to a single mail content. Each of these are retrieved and saved into an independent text file separately. Figure 3 provides a visualisation of the end results of separation. The dataset is then separated into training and testing dataset (into separate subfolders).

Figure 3. Illustration of a text of an email in a separate txt file.

The program of this example is stored in the Github repository below.

Refer to the directory SpamMailFiltering for this use case example. The codebase will be further explained below for better understanding.

The program is based on open source Java based deep learning framework — DeepLearning4J (DL4J). If you are new to DL4J, you can refer to another article of mine here for an introduction and installation of it.

Before running the main program by executing the file, you will need to download the pretrained embeddings for text. To train neural networks with input data in the forms of texts, the texts has to be converted into embeddings. In this example, pretrained model to convert text to embedding is used. If you want to understand more about Word2Vec, here’s a good link for it.

The example started with loading of pretrained Word2Vec model with Google news corpus. This pretrained model is trained with 3 billion running words, outputting 300-dimension English word vectors. Download it from here and change the WORD_VECTORS_PATH in to point to the path saving the file.

Figure 4. Assign WORD_VECTORS_PATH to local directory path of GoogleNews-vectors-negative300.bin.gz

After setting WORD_VECTORS_PATH, run the code by executing While the neural network is training, open up http://localhost:9000 to virtualise the raining progress.

Figure 5. Visualization of model training.

The program might takes quite some time to run on CPU backend. Loading the large Google news corpus is time consuming. Alternatively, switching to CUDA backend would takes shorter time of execution. Doing it is simple as changing a line in pom.xml. Meanwhile, you can let the program runs, take a break and go grab a cup of coffee ☕.

The description below provides a more detailed walkthrough of the process.

The training and testing data is stored in directories with structure as illustrated below. There are train and test folders, with spam and non-spam sub-directory folders in each.

Figure 6. Data Directory Structure

These data is vectorized through customized SpamMailDataSetIterator as illustrated in Figure 7. This process include reading in each text files, perform tokenization on string of texts with a fixed truncated length, and separating data samples into batches with preferred batch size.

Figure 7. SpamMailDataSetIterator Initialization

Next, Long Short Term Memory (LSTM) model is configured to model the data. LSTM is commonly used to model sequential data due to it’s ability to capture long term dependencies. Check out this link to learn more about LSTM.

Fig 8. LSTM Neural Network Architecture

As shown in Figure 8, the network started with a LSTM layer of 300 units which is the dimension of pretrained word embeddings. Each mail texts will be truncated to the prefixed length if the original length is longer than that. The network continues with an output layer of 2 classes as spam and non-spam labels.

Evaluation on the testing dataset after 1 epoch shows a promising result as illustrated in Figure 6. Among the 150 samples of spam mails, 103 mails are identified correctly as spam while 47 mails are wrongly labeled as false negative.

Fig 9. Evaluation Results.

I also tested the model on a sample spam mail to evaluate the output. With the text of “Congratulations! Call FREEFONE 08006344447 to claim your guaranteed £2000 CASH or £5000 gift. Redeem it now!", which is notably a prize scam, the model resulted in probability of spam on 98%. This shows a confidence that the model able to identify spam and non-spam mail distinctively.

Fig 10. Probabilities Output for A Sample Spam Mail.

I’ll post more articles in the category of Natural Language Processing with source code provided for a practical walkthrough. Stay tuned!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store