Seeing words: A Deep-Learning Spam Classifier that can crunch Unicode and weird Youtube comments

Highlights:

  • Minimalist VGG-style convnet for classifying unicode character sequences as ham or spam
  • Best classification accuracy 89% (trained and tested on Youtube Spam dataset)
  • No need for tokenizing or word embedding
  • Faster training vs LSTMs
  • Suitable for transfer learning e.g. transferring to classifying tweets or other social media comments

Intro:

Being on extended vacation post-graduation means having a lot of time to think about things. One of the things I’ve been thinking about recently is how to do natural language processing (NLP) effectively with deep neural networks using real world language examples. An example would be to classify the youtube comment

mix psy gangnam style 강남스타일 mv psy gangnam style 강남스타일 mv

as ham or spam without ignoring the unicode Korean characters, without having to use a pre-trained word embedder (such as word2vec) and retaining the raw character sequence without tokenizing. While using an embedder is nice, it adds to network complexity and pre-trained embedders can miss words and unicode characters which can still have information. By the way, according to youtube the above comment is not actually spam. Go figure.

Thus the idea was to use a VGG-style convnet to classify an image representation of a character sequence making up a comment. Why would this work, you ask? While it is true that written language is essentially a time dependent sequence of characters, practical sequence learning with LSTMs - especially if you have multivariate and sparse sequences - is not easy at all and quite slow compared to feedforward nets like the convnet. Instead, I wanted to treat the entire character sequence as a binary 140x32 image and use some powerful and quick-to-train convnets to classify spam. The convnet should be able to learn higher-order features of the spam and ham image representations along its successive convolution layers, like how a deep LSTM net learns the higher-order temporal features of a sequence in each successive layer.

Each comment was capped/padded at 140 characters max, converted to all lowercase and pruned of punctuation. Then, each character in the comment string was encoded to a 32-dimensional binary vector - this vector contains the 32-bit unicode encoding of the character. The vectors were then concatenated into a matrix. After adding a dummy dimension (the color channel dimension) the input to the first convnet layer is a 140 x 32 x 1 tensor. The image representation is really just a sequence of 32-dim binary vectors that you can feed to an LSTM layer.

Implementation:

The convnet was implemented in Python 3.5 using more or less the latest versions of Keras and Tensorflow. The net uses a mean_squared_error loss and rmsprop as the optimizer. It was found that softmax activation on the last output neuron and a binary_crossentropy loss did not give stable results for this particular setup. Here’s the main.py file and the process.py file. You need both to run the code. Also, grab the dataset from http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection if you haven’t done so already.

Training and testing:

Training for 4 epochs of 1759 single batches, with speed of approximately 2–3 batches per second:

1/1 [==============================] - 0s - loss: 2.5850e-04 - acc: 1.0000
Epoch 3 Iteration 1750
Epoch 1/1
1/1 [==============================] - 0s - loss: 1.4149e-05 - acc: 1.0000
Epoch 3 Iteration 1751
Epoch 1/1
1/1 [==============================] - 0s - loss: 0.0318 - acc: 1.0000
Epoch 3 Iteration 1752
Epoch 1/1
1/1 [==============================] - 0s - loss: 5.2865e-06 - acc: 1.0000
Epoch 3 Iteration 1753
Epoch 1/1
1/1 [==============================] - 0s - loss: 1.4153e-04 - acc: 1.0000
Epoch 3 Iteration 1754
Epoch 1/1
1/1 [==============================] - 0s - loss: 1.1740e-06 - acc: 1.0000
Epoch 3 Iteration 1755
Epoch 1/1
1/1 [==============================] - 0s - loss: 0.2504 - acc: 0.0000e+00
Epoch 3 Iteration 1756
Epoch 1/1
1/1 [==============================] - 0s - loss: 6.4793e-04 - acc: 1.0000
Epoch 3 Iteration 1757
Epoch 1/1
1/1 [==============================] - 0s - loss: 0.0021 - acc: 1.0000
Epoch 3 Iteration 1758
Epoch 1/1
1/1 [==============================] - 0s - loss: 3.2439e-05 - acc: 1.0000
Epoch 3 Iteration 1759

And validation against test set. The classification output format is (actual label: 0 for ham, 1 for spam)|(classifier output):

1|0.984469 look at my channel i make minecraft pe lets play  
0|0.571624 i found out this song now
0|0.00116367 2 billion
0|0.00136812 song name
1|0.996406 please like httpwwwbubblewscomnews9277547peaceandbrotherhood
1|0.887957 hey again if you guys wouldnt mind chacking out my rap give it like and il giver 3 of your vids a like
1|0.996646 httpwwwamazoncoukgpofferlistingb00ecvf93gsr82qid1415297812refolptabrefurbishedieutf8ampconditionrefurbishedampqid1415297812ampsr82
0|0.824152 the most watched video on youtube is psy’s “gangnam style” with 21 billion views psy gangnam style 강남스타일 mv

Discussion:

Pretty cool, huh? The best test accuracy on a few of my runs after 4 epochs was 89%. The network has a fair bit of dropout to keep overfitting in check, but you can experiment with putting punctuation back in, different optimizers, batch sizes, regularization, adjusting conv filter sizes and/or reducing the filter numbers for the last two Conv2D layers (as information at that level should be more abstract) to squeeze out more accuracy. The trained weights can be saved and used for transfer learning on twitter or instagram comments. Anyways, hope you had fun reading and experimenting with this code. Leave a comment if you have any questions.