Spam-Ham Classification Using LSTM in PyTorch

Published in

Analytics Vidhya

4 min readSep 4, 2019

This is how to build and train LSTM model in PyTorch and use it to predict Spam or Ham.

Github repo for this guide is here, you can see Jupyter notebook in the repo. My recommendation is to download the notebook, see this walkthrough to follow up, and play around.

I. Enron Spam Datasets

Researchers — V. Metsis, I. Androutsopoulos and G. Paliouras — classified over 30,000 emails in the Enron corpus as Spam/Ham datasets and have had them open to the public

Go to the website
Find Enron-Spam in pre-processed form in the site
Download Enron1, Enron2, Enron3, Enron4, Enron5 and Enron6
Extract each tar.gz file
Directories — enron1, enron2, … , enron6 — should be under the same directory where you place Jupyter notebook

II. Processing data

The data will be

loaded from files
used to build vocabulary dictionary
tokenized and vectorized

Let’s dive into each step

II-1. Load data from files

It is needed to download file_reader.py into the same folder. I will briefly introduce the code(file_reader.py) that I wrote.

At first, spam and ham sets will be loaded into spamand ham respectively. Secondly, ham and spam will be merged into data. Thirdly, labels will be generated for spam and ham, 1 and 0, respectively
Lastly, it returns data and labels:

The loaded data consists of 3000 of hams and 3000 of spams — In total, 6000
If you set max = 0, you can get all the data from files. But for this tutorial, 6000 sets are enough

II-2. Build Vocab dictionary

Vocabulary dictionary has keys and values in it: words and integers, respectively. e.g. {‘the’: 2, ‘to’: 3}

How to choose the integer numbers in the dictionary?

Please imagine a list of words from 6,000 datasets. Common words like “the”, “to” and “and” are more likely to be present multiple times in the lists. We will count the number of occurrences and order the words by their counts.

II-3. Tokenize & Vectorize data

Let’s see the example codes first

Tokenization means here the conversion from datasets into lists of words. For example, let say we have data like below

"operations is digging out 2000 feet of pipe to begin the hydro test"

Tokenization will produce a list of words like below

['operations', 'is', 'digging', ...

Vectorization means here the conversion from words into integer numbers using the vocab dictionary built in II-2

[424, 11, 14683, ...

Now we can proceed with our datasets

III. Build Data loaders

So far, data is processed as vectorized form. Now, it is turn to build data loaders that will feed the batches of datasets into our model. To do so,

A custom data loader class is needed
Three data loaders are needed: for train, validation, and test

III-1. Custom Data Loader

Sequence here means a vectorized list of words in an email.
As we prepared 6,000 e-mails, we have 6,000 sequences.

As sequences have different lengths, it is required to pass the length of each sequence into our model not to train our model on dummy numbers ( 0s for padding ).

Consequently, we need custom data loaders that return lengths of each sequence along with sequences and labels.

Plus, The data loader should sort the batch by each sequence’s length and returns the longest one first in the batch to use torch’s pack_padded_sequence() (you will see this function later)

I built the iterable data loader class using torch’s sampler.

III-2. Instantiate 3 Data Loaders

Model will be trained on train datasets, be validated by validation dataset, and finally be tested on test datasets:

IV. Structure the model

The model consists of

Embedding
Pack the sequences (get rid of paddings)
LSTM
Unpack the sequences (recover paddings)
Fully Connected Layer
Sigmoid Activation

IV-1. Embedding

According to PyTorch.org’s documentation, “word embeddings are a representation of the semantics of a word”

To know what the Word Embeddings is, I would recommend you to read PyTorch Documentation

IV-2. Use of pack_padded_sequence()

Please recall that we added padding(0)s to sequences. Since sequences have different lengths, it is required to add paddings into shorter sequences to match the dimension in tensor. The problem is that model should not be trained on padding values. pack_padded_sequence() will get rid of paddings in the batch of data and re-organized it

For example,

To understand more about pack_padded_sequence(), I would recommend you to read layog's stack overflow post and HarshTrivedi's tutorial

IV-3. LSTM

LSTM stands for “Long short-term memory”, a kind of RNN architecture. Note that, If (h_0, c_0) is not provided, both h_0 and c_0 default to zero according to PyTorch documentation

For LSTM, I would recommend you to read colah's blog

V. Train, Validate and Test

V-1. Train and Validate

V-2. Test

VI. Predict

This is a part of the advertisement of English lesson I’ve got recently. It seems a bit tricky to predict it as spam, isn’t it?

Have you been really busy this week? Then you'll definitely want to make time for this lesson. Have a wonderful week, learn something new, and practice some English!

Let’s put this into model and see if the result is “Spam”

The model counts it as Spam!

Thank you for your reading!

I never have expected myself writing a guide since I still see myself as a beginner in deep learning. If you find something wrong, please email me or leave your comments, it would be appreciated.

Email: shijoonlee@gmail.com
Github: github.com/sijoonlee