Building an AI Sift — the neural network classifier

6 min readFeb 12, 2016

, we experimented with three email classifiers. The three approaches showed interesting results but they had their limitations:

Naive Bayes was fast but the accuracy was significantly lower.
Logistic Regression had a high accuracy but was significantly slower to train and required lots of memory.
TF-IDF was accurate for some labels but had an overall low accuracy.

These approaches were meant to be benchmarks for the neural network classifier covered in this post.

Inspired by Andrey’s excellent post on a neural network based email classifier, we decided to test a similar implementation utilising the same set of tools suggested in his post — Python, scikit learn and the Keras deep learning library. An interesting aspect of our platform is that we support Sifts implemented in multiple languages. Making use of this feature, the classifiers outlined in our previous blog post are written in JavaScript while the neural network classifier is written in Python.

Setting Up

We jumped right into implementing the neural network since we had all the tokenised and stemmed data from our previous implementation. We added the neural network implementation as a node to our Directed Acyclic Graph forming the Sift.

We packaged all the library dependencies of Keras and our Python neural network node into a Docker container for easy reuse.

Here is a quick recap of the data set used to validate the prediction accuracy of the classifiers of the Sift, presented in our previous post. A corpus of 7.4k emails across five labels, broken down into 5.5k emails for training and 1.8k emails for validation.

ICE 89.3 %
INTJ 4.5 %
Tax & Accounting 3.1 %
Mistakes 1.8 %
Property 1.3 %

The Neural Network

Our neural network based classifier trains itself on the corpus of labeled emails described above. The trained model is then capable of predicting labels of future emails.

The Keras library requires the corpus in numeric form. Our Python node transforms the incoming sequences of tokens, represented as strings, into sequences of integer IDs. The model only looks at training emails when building the dictionary from token to integer ID.

We obtained good results by using the off-the-shelf multilayer perceptron described in the Keras text classification example. This model has surprisingly short training times and high accuracy.

The following code snippet shows the setup of the neural network we used. The configuration is identical to the text classification example of Keras.

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation(‘relu’))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation(‘softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)model.fit(X_train, Y_train, ...) # parameters omittedscore = model.evaluate(X_test, Y_test, ...)y_predicted = model.predict_classes(X_test)

Under the Hood: Training time vs accuracy

In this section we are looking at the trade off between training time and accuracy. We are running the neural network several times while keeping an increasing number of tokens in the training corpus. Keeping fewer tokens speeds up the training and reduces the memory footprint. Our classifier can select a number of tokens k, either the most common k tokens or the ones selected by SelectKBest algorithm of scikit learn, also described in Andrey’s blog post.

For a corpus of a fixed size (5.5k training documents in our test data set) the training time of the neural network depends linearly on the number of tokens fed into the model. While fitting over the 5.5k training documents, the training time ranges from a few seconds using 500 tokens to a little more than 2 minutes when keeping all 12k tokens of the input documents.

The accuracy of predictions as a factor of the number of tokens and the method used to select the tokens. Note that the training time increases linearly with the number of tokens. The “*most common”* strategy is selecting the most common tokens and “best” is using the SelectKBest feature selector of scikit learn. When keeping only 500 tokens, “best” performs slightly better than “most common”. At 1k tokens, “most common” performs significantly better than “best”. At 5k tokens, “best” performs better than “most common”. When using all 12k tokens, both strategies produce the same result as expected. The model shuffles documents during training, the error bars show the fluctuation due to training over the documents in a different order. The plot shows the average accuracy over five runs with different random number generator seeds plus-minus one standard deviation.

Under the Hood: Token selection

The following list displays the 10 tokens with highest scores of both tokens selection strategies. When selecting 1k token with each method, the two sets containing the favourite tokens of each selection strategy have 250 tokens in common. The Jaccard Similarity of the two sets is 0.143. The tokens are obtained by the tokenizer and stemmer node of the Sift node implemented in JavaScript. Notice the popular email domain Gmail among the 10 most common tokens.

The ten tokens with the highest scores selected by the most common and best strategies are:

       "most common"    "best"
 1.     com              com
 2.     gmail            devast
 3.     not              self
 4.     subscrib         www
 5.     wil              net
 6.     http             top
 7.     work             subatom
 8.     rahulp           resid
 9.     off              request
10.     tim              prohibit

Comparing the two lists, the best tokens seem to contain more specific words.

Under the Hood: Confusion matrices

The confusion matrices below give us an overview of the classification performance of the neural network when using 1k and 12k tokens. True labels are shown on the left, predicted labels are on top.

Confusion matrix when using the most common 1k tokens. The model misclassified documents of the categories “Mistakes”, “Property” and “Tax & Accounting” as “ICE”. Note that that almost 90 % of the documents in the corpus belong to the category “ICE”. It wrongly applied the label “Mistakes” to documents of the classes “ICE” and “Property”.

Confusion matrix when using the all 12k tokens. This confusion matrix shows that the classifier predicted the label “ICE” for some emails emails belonging to the categories “Mistakes”, “Property” and “Tax & Accounting”.

The following screenshot shows a side by side comparison of the confusion matrices of all four classification algorithms we built into our Sift.

Comparison of 4 different email classifiers. Randal presented Naive Bayes, Logistic Regression and TF-IDF in his blog post. The neural network extension to the Sift presented here reaches high prediction accuracy (98.97 %)comparable with the Logistic Regression based model (99.4 %). The time required for training the Neural Network is much shorter than the training time of the Logistic regression classifier — 3 hours vs 6 seconds.

Verifying our results

When running over a larger corpus of 32k emails across the same five labels, we surprisingly obtained 100% accuracy, the model did not miss-predict a single label! Good news for all of you who are manually filing emails, technology is here to help you. It seemed almost too good to be true, so we investigated further. We moved more emails from training into the verification set, wondering when the model would start misclassifying emails. At a split of about 50/50 between training and verification emails, the model started to miss-predict labels with 99% accuracy. So it was working, just far better that we expected.

We hypothesised that the email corpus we were using included a lot of mailing lists and newsletters, which contained visible structures that could create a strong signal for the classifier. We wanted to see what would happen if we fed it a really noisy signal, similar to what a real world inbox classifier might encounter. For the next test run we had a colleague manually classify approximately 5000 emails across 14 labels over the weekend. On this data set the neural network reached an accuracy of 94%. Given the that this dataset is probably very hard for even a human being to reliably classify, the accuracy of this off-the-shelf untuned network continues to surprise us.

The model performs very well, with 30 seconds training time on the corpus of 32k emails and an accuracy of 94% on the real word data set. With this performance profile we can regularly re-train the network to classify a users’s inbox with useful levels of accuracy.

What comes next?

The test results presented here suggest that neural network based classifiers can assist users with organising their emails.

We plan to convert our experiment into an end-user Sift, that regularly trains on a user’s email archive, learns about existing labels and classifies new emails based on the trained model. Users will be able to install the Sift using our Chrome Extension for Gmail. We will be adding functionality to allow users to correct mis-classified labels and use this feedback to improve the classifier when it trains next.

The Sift will be made available to our early access users and revealed in a final post. We will open source the code and the Docker containers used in the Sift so that developers can fork and tinker with the Sift.

Intrigued?

We are handing out a new batch of invites for our early access program in the coming months and we would love to see the interesting Sifts you can come up with. If hacking your own data and potentially creating solutions that can help thousands (or millions) of other people sounds interesting to you, why not register here for early access or email us?