Introducing One of the Best Hacks in Machine Learning: the Hashing Trick

2018 has been hailed by various outlets as the year when spam will start to die because machine learning algorithms will become almost perfect at figuring out what is real mail and what’s not. I’m not convinced that will ever happen (advances in machine learning cut both ways), but I’d like to share some general thoughts on how ML-based simple spam classifiers are built and how to overcome a significant issue, filter circumvention, using one of the best hacks in machine learning: the hashing trick. It’s useful outside spam detection, too.

Building a simple spam classifier

For document classification tasks, including spam classification, one usually starts by building what’s known as a bag-of-words (BOW) representation. Given a set of known spam and non-spam emails, each unique word is added to a vocabulary and assigned a unique index, typically starting from 0. Let’s say, for brevity’s sake, that we have a set of two short text examples, one which is spam and another that is legitimate:

i make ten thousand dollars per week just surfing the web! (spam)
are you free for a meeting early next week? (not spam)

If we scan the dataset and start building our vocabulary, we might end up with something like this:

i: 0
make: 1
ten: 2
thousand: 3
dollars: 4
per: 5
week: 6
just: 7
surfing: 8
the: 9
web: 10
are: 11
you: 12
free: 13
for: 14
a: 15
meeting: 16
early: 17
next: 18

There are 19 unique words in total, and each is assigned a unique index (note that the word week appears in both examples). The next step is to create feature vectors for our machine learning model. We start by creating a zero column vector for each example, with the same number of elements as there are words in our vocabulary (19):

i make ten thousand dollars per week just surfing the web! (spam)
-> [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
are you free for a meeting early next week? (not spam)
-> [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Then, for each word in each example, we perform a vocabulary lookup to get the index and increment the value at that index by one:

i make ten thousand dollars per week just surfing the web! (spam)
-> [1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0]
are you free for a meeting early next week? (not spam)
-> [0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1]

The resulting feature vectors are bag-of-words representations. BOW representations typically throw out information about punctuation and word order, but for many problems, this isn’t an issue. More sophisticated BOW representations use TF-IDF weights and/or n-grams instead of raw word counts, but the basic idea is the same.

One we have our BOW feature vectors, we can train a binary classifier to build a spam filter. There are many choices concerning learning algorithms, but the most common suspects are Naïve Bayes, random forests, logistic regression and, increasingly, neural networks. Given a trained model, we can use the vocabulary to feed in a new email as a BOW vector and predict whether or not the example is spam. Note that for real-time inference, we need to keep the vocabulary in RAM to be as fast as possible.

The issue: filter circumvention

Spammers are crafty. One popular way of making sure spam doesn’t get filtered out is to mix in words not in the vocabulary used to learn the classifier. Consider, for example, the following, slightly contrived sentence:

ii mayke are you th0usands of free for a $$$s surf1ing teh webz meeting early next week

Clearly, this is not something anyone would consider a legitimate email. But what happens if we use our vocabulary to build a BOW vector for this example? The first eight words aren’t in our vocabulary at all, and won’t make it in. The rest are, resulting in the following vector:

ii mayke are you th0usands of free for a $$$s surf1ing teh webz meeting early next week
-> [0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1]

This vector is the same as the one for the legitimate example are you free for a meeting early next week? . Any classifier trained on our examples would likely think this spam is legitimate. This is a significant problem, and not as easy to solve as one might think. We could add the new words to our vocabulary, but that would mean the resulting feature vectors’ size will change, as will the vocabulary itself. Machine learning models typically learn on fixed-size training examples, so we would need to retrain our model from scratch. That takes time, and whilst we’re doing that, the old classifier will keep on accepting spam. We need a solution that a) can deal with out-of-vocabulary words, b) doesn’t require us to retrain our models from scratch every time we encounter a new word or misspelling, and c) is as accurate as possible. If we could get away without keeping a huge vocabulary in RAM, even better.

Introducing the hashing trick

Hash functions are fundamental to computer science. There are lots of different types of hash functions, but they all do the same thing: map data of arbitrary sizes to data of a fixed size. Typically, they spit out a number (known as a hash):

"John Doe" -> hash function -> 34
"Jane Doe" -> hash function -> 48

The logic by which a hash is calculated depends on the hash function itself, but all hash functions share the same common characteristics:

  • If we feed the same input to a hash function, it will always give the same output.
  • The choice of hash function determines the range of possible outputs, i.e. the range is always fixed (e.g. numbers from 0 to 1024).
  • Hash functions are one-way: given a hash, we can’t perform a reverse lookup to determine what the input was.
  • Hash functions may output the same value for different inputs (collision).

Hash functions are incredibly useful in just about any area of computer science, but how can they be used to fix the out-of-vocabulary issue of our spam classifier? The answer isn’t immediately obvious, but the first step is to get rid of our vocabulary altogether. Instead, when constructing our BOW representations, we’ll start by making a zero column vector with a huge number (say, 2²⁸) of elements for each of our training examples:

i make ten thousand dollars per week just surfing the web! (spam)
-> [0 0 0 0 ... 0 0 0 0] (2^28 elements)
are you free for a meeting early next week? (not spam)
-> [0 0 0 0 ... 0 0 0 0] (2^28 elements)

Next, we’ll choose a hash function f that eats strings and outputs values in the range [0, 2²⁸). In other words, we’re making sure that our hash function will never address an index outside our feature vectors’ dimensions.

After this initialisation, for each training example, we feed each word, one by one, through our hash function, and increment the value at the given index by one — just as before. We might end up with sparse vectors like this:

i make ten thousand dollars per week just surfing the web! (spam)
-> [0 ... 0 1 1 1 0 1 1 0 ... 0 1 1 1 1 0 1 1 0] (2^28 elements)
are you free for a meeting early next week? (not spam)
-> [0 1 0 1 0 ... 0 1 0 ... 0 1 0 ... 0 1 10 1 1 0 1] (2^28 elements)

This process is known as the hashing trick.

We now have our BOW representation, and can train a classifier on the data as before. Simple, no? We’ve foregone the use of a separate vocabulary, which means we don’t have to keep a potentially large list of words in RAM. But that is merely a nice side-effect — the real issue we want to tackle is filter circumvention using out-of-vocabulary words. So how does the hashing trick help?

Let’s say we have a spam classifier trained on a bunch of sparse 2²⁸ BOW feature vectors. Given a new piece of mail, we do as before, initialising a 2²⁸ vector and passing each word through our hash function. Unlike before, every single word ends up incrementing some feature value. Given our BOW vector, every word — even new ones — is taken into account at prediction time. New words still worsen the accuracy of our classifier, but it’s no longer possible to circumvent our spam filter entirely by making up new words. Since the BOW vectors all stay the same size, we can incrementally fit our model with new spam/non-spam examples without retraining the entire thing from scratch. It’s a form of online learning: when a user marks an email as spam, the model learn is capable of learning from that incrementally, without restarting the entire process. For a practical application like spam filtering, this is a clear benefit of feature hashing: we can react quickly to attacks by doing learning as soon as new spam/non-spam examples come in.

But what about collisions, I hear you ask? Isn’t it possible that some intentional misspelling ends up incrementing the same index as some legitimate word as it passes through the hash function? Yes, it can happen, but if you choose your vector size (make it as large as possible) and hash function carefully, the odds of this happening is negligible, and even if it does, it usually doesn’t affect learning (or accuracy) that much. The documentation for standard hash functions typically include collision probabilities, so make sure to look those up when making your own hashing trick solution.

Note that in some cases, you may even want collisions (e.g. for grouping similar legitimate words), in which case you may want to bucket those before hashing.

Some final thoughts

The hashing trick is one of those neat tricks in machine learning that doesn’t get nearly as much love as it deserves. The only real downside is the fact that reverse lookups (output to input) aren’t possible, but for many problems, that isn’t a requirement. Thinking in more general terms, the hashing trick allows you to use variable-size feature vectors with standard learning algorithms (regression, random forests, feed-forward neural networks, SVMs, matrix factorisation, etc.). That should be enough to make most machine learning practitioners at least a little bit excited.