Spam-Ham Classification Using LSTM in PyTorch
This is how to build and train LSTM model in PyTorch and use it to predict Spam or Ham.
Github repo for this guide is here, you can see Jupyter notebook in the repo. My recommendation is to download the notebook, see this walkthrough to follow up, and play around.
I. Enron Spam Datasets
Researchers — V. Metsis, I. Androutsopoulos and G. Paliouras — classified over 30,000 emails in the Enron corpus as Spam/Ham datasets and have had them open to the public
- Go to the website
- Find
Enron-Spam in pre-processed form
in the site - Download Enron1, Enron2, Enron3, Enron4, Enron5 and Enron6
- Extract each tar.gz file
- Directories — enron1, enron2, … , enron6 — should be under the same directory where you place Jupyter notebook
II. Processing data
The data will be
- loaded from files
- used to build vocabulary dictionary
- tokenized and vectorized
Let’s dive into each step
II-1. Load data from files
It is needed to download file_reader.py into the same folder. I will briefly introduce the code(file_reader.py) that I wrote.
At first, spam and ham sets will be loaded into spam
and ham
respectively. Secondly, ham
and spam
will be merged into data
. Thirdly, labels will be generated for spam and ham, 1 and 0, respectively
Lastly, it returns data and labels:
The loaded data consists of 3000 of hams and 3000 of spams — In total, 6000
If you set max = 0
, you can get all the data from files. But for this tutorial, 6000 sets are enough
II-2. Build Vocab dictionary
Vocabulary dictionary has keys and values in it: words and integers, respectively. e.g. {‘the’: 2, ‘to’: 3}
How to choose the integer numbers in the dictionary?
Please imagine a list of words from 6,000 datasets. Common words like “the”, “to” and “and” are more likely to be present multiple times in the lists. We will count the number of occurrences and order the words by their counts.
II-3. Tokenize & Vectorize data
Let’s see the example codes first
Tokenization
means here the conversion from datasets into lists of words. For example, let say we have data like below
"operations is digging out 2000 feet of pipe to begin the hydro test"
Tokenization will produce a list of words like below
['operations', 'is', 'digging', ...
Vectorization
means here the conversion from words into integer numbers using the vocab dictionary built in II-2
[424, 11, 14683, ...
Now we can proceed with our datasets
III. Build Data loaders
So far, data is processed as vectorized form. Now, it is turn to build data loaders that will feed the batches of datasets into our model. To do so,
- A custom data loader class is needed
- Three data loaders are needed: for train, validation, and test
III-1. Custom Data Loader
Sequence
here means a vectorized list of words in an email.
As we prepared 6,000 e-mails, we have 6,000 sequences.
As sequences have different lengths, it is required to pass the length of each sequence into our model not to train our model on dummy numbers ( 0s for padding ).
Consequently, we need custom data loaders that return lengths of each sequence along with sequences and labels.
Plus, The data loader should sort the batch by each sequence’s length and returns the longest one first in the batch to use torch’s pack_padded_sequence()
(you will see this function later)
I built the iterable data loader class using torch’s sampler.
III-2. Instantiate 3 Data Loaders
Model will be trained on train datasets, be validated by validation dataset, and finally be tested on test datasets:
IV. Structure the model
The model consists of
- Embedding
- Pack the sequences (get rid of paddings)
- LSTM
- Unpack the sequences (recover paddings)
- Fully Connected Layer
- Sigmoid Activation
IV-1. Embedding
According to PyTorch.org’s documentation, “word embeddings are a representation of the semantics of a word”
To know what the
Word Embeddings
is, I would recommend you to read PyTorch Documentation
IV-2. Use of pack_padded_sequence()
Please recall that we added padding(0)s to sequences. Since sequences have different lengths, it is required to add paddings into shorter sequences to match the dimension in tensor. The problem is that model should not be trained on padding values. pack_padded_sequence() will get rid of paddings in the batch of data and re-organized it
For example,
To understand more about
pack_padded_sequence()
, I would recommend you to read layog's stack overflow post and HarshTrivedi's tutorial
IV-3. LSTM
LSTM stands for “Long short-term memory”, a kind of RNN architecture. Note that, If (h_0, c_0)
is not provided, both h_0 and c_0 default to zero according to PyTorch documentation
For
LSTM
, I would recommend you to read colah's blog
V. Train, Validate and Test
V-1. Train and Validate
V-2. Test
VI. Predict
This is a part of the advertisement of English lesson I’ve got recently. It seems a bit tricky to predict it as spam, isn’t it?
Have you been really busy this week? Then you'll definitely want to make time for this lesson. Have a wonderful week, learn something new, and practice some English!
Let’s put this into model and see if the result is “Spam”
The model counts it as Spam!
Thank you for your reading!
I never have expected myself writing a guide since I still see myself as a beginner in deep learning. If you find something wrong, please email me or leave your comments, it would be appreciated.
Email: shijoonlee@gmail.com
Github: github.com/sijoonlee