Spam Mail Filtering with DeepLearning4J
Document classification is one of the common use cases in the domain of Natural Language Processing (NLP) and well applied in many applications. This example demonstrate document classification with the use case of spam mail filtering. The results shows that by using Deep Learning, we can strategically filter out most of the spam mails based on the context.
Implementation
Workflow
The workflow of spam filtering is shown in Fig 1. The workflow started with data cleansing and restructuring to prepare data into a ready format for training.
The original dataset can be retrieved from here. In the uncompressed folder, you can see that the whole dataset is in SMSSpamCollection.txt. In the file, there are labels with the word “ham” and “non-spam”, which correspond to spam and non-spam data.
The distribution of the dataset is displayed in the chart above. This dataset is unbalanced with more non-spam data points (4827 samples) compared to spam data points (747 samples). In this example, the unbalanced data is modelled with classification model. There is a caveat where the network might adapt to patterns of non-spam mail better due to the volume of the data. The performance may be better improved with a balanced dataset.
The data processing step starts by reading in the original text file — SMSSpamCollection.txt. The text file contains multiple lines of text where each string of text constituted to a single mail content. Each of these are retrieved and saved into an independent text file separately. Figure 3 provides a visualisation of the end results of separation. The dataset is then separated into training and testing dataset (into separate subfolders).
Get the Codebase
The program of this example is stored in the Github repository below.
https://github.com/codenamewei/nlp-with-use-case
Refer to the directory SpamMailFiltering for this use case example. The codebase will be further explained below for better understanding.
The program is based on open source Java based deep learning framework — DeepLearning4J (DL4J). If you are new to DL4J, you can refer to another article of mine here for an introduction and installation of it.
Loading of Pretrained Word2Vec Model
Before running the main program by executing the file SpamMailFiltering.java, you will need to download the pretrained embeddings for text. To train neural networks with input data in the forms of texts, the texts has to be converted into embeddings. In this example, pretrained model to convert text to embedding is used. If you want to understand more about Word2Vec, here’s a good link for it.
The example started with loading of pretrained Word2Vec model with Google news corpus. This pretrained model is trained with 3 billion running words, outputting 300-dimension English word vectors. Download it from here and change the WORD_VECTORS_PATH in SpamMailFiltering.java to point to the path saving the file.
To Run the Code
After setting WORD_VECTORS_PATH, run the code by executing SpamMailFiltering.java. While the neural network is training, open up http://localhost:9000 to virtualise the raining progress.
The program might takes quite some time to run on CPU backend. Loading the large Google news corpus is time consuming. Alternatively, switching to CUDA backend would takes shorter time of execution. Doing it is simple as changing a line in pom.xml. Meanwhile, you can let the program runs, take a break and go grab a cup of coffee ☕.
The description below provides a more detailed walkthrough of the process.
Data Vectorization
The training and testing data is stored in directories with structure as illustrated below. There are train and test folders, with spam and non-spam sub-directory folders in each.
These data is vectorized through customized SpamMailDataSetIterator as illustrated in Figure 7. This process include reading in each text files, perform tokenization on string of texts with a fixed truncated length, and separating data samples into batches with preferred batch size.
Network Architecture
Next, Long Short Term Memory (LSTM) model is configured to model the data. LSTM is commonly used to model sequential data due to it’s ability to capture long term dependencies. Check out this link to learn more about LSTM.
As shown in Figure 8, the network started with a LSTM layer of 300 units which is the dimension of pretrained word embeddings. Each mail texts will be truncated to the prefixed length if the original length is longer than that. The network continues with an output layer of 2 classes as spam and non-spam labels.
Evaluation Results
Evaluation on the testing dataset after 1 epoch shows a promising result as illustrated in Figure 6. Among the 150 samples of spam mails, 103 mails are identified correctly as spam while 47 mails are wrongly labeled as false negative.
I also tested the model on a sample spam mail to evaluate the output. With the text of “Congratulations! Call FREEFONE 08006344447 to claim your guaranteed £2000 CASH or £5000 gift. Redeem it now!", which is notably a prize scam, the model resulted in probability of spam on 98%. This shows a confidence that the model able to identify spam and non-spam mail distinctively.
What’s Next
I’ll post more articles in the category of Natural Language Processing with source code provided for a practical walkthrough. Stay tuned!