Using Naïve Bayes to create a Spam Classifier for Gites

Published in

Hoppinger

19 min readAug 13, 2019

by Giulia Costantini and Giuseppe Maggiore

Introduction

In this article we will explain how we built a (quite accurate!) spam classifier for the text messages posted in Gites, an online marketplace for the advertisements of rental and sale of holiday homes/B&B. A message in this context is considered as spam if it contains contact info (more or less explicit, like a concrete email/telephone number, or a more vague “contact us from our website for discounts on the price”) which would move the communication between user and advertiser outside of the marketplace, where it’s supposed to remain. We used a fairly known and reasonably simple to approach technique to do this, that is the Naïve Bayes classifier. However, instead of making a call to an existing library, we decided to implement everything ourselves from scratch, in order to have the possibility of adjusting the algorithm to work better with our specific case study. Machine learning is not an easy field, and just applying a given technique as a magical “black box” can often result in disappointing outcomes. Without a thorough knowledge of the context and the possibility of tweaking the “insides” of the algorithm to better suit such context, improving the accuracy is going to be a tricky (if not impossible) task.

We are going to start with explaining how a basic Naïve Bayes spam classifier works, also referring concretely to its implementation (which we made in F#). Then we are going to discuss which custom changes were made to the algorithm in order to greatly improve the accuracy in our context.

A vanilla Naïve Bayes spam classifier

Short informal description: how does it work?

Very informally, this is how the classifier works. Given a set of labelled messages (message + spam/ham[1] label), the classifier tries to understand which words are good “hints” to decide if a message is spam or not, by computing the probability of each word appearing in a spam/ham message. Each word is thus associated with two probabilities: the probability of that word being in a spam message, and the probability of that word being in a ham message. This step is entirely based on the dataset of labelled messages.

When we want to classify a new message (which is thus not labelled yet), we look at the words which form the message and we combine the hints (i.e., the probabilities) which come from every one of them. The classification is then made comparing the “weight” of hints pointing towards spam and hints pointing toward ham: the biggest one wins. For example, if the probability of the word “website” appearing in spam messages is very high but the probability of the same word appearing in ham messages is very low, then the presence of such word in a message to classify is a strong hint towards the message being spam (even though this hint has to be combined with the hints given by all other words in the message).

The name of the technique comes from Bayes theorem, on which the probability computation (being spam or being ham) is based.

A more formal description, with formulas

More formally, here are the formulas we will need to be able to implement the algorithm.

We want to estimate the probabilities of the different outcomes (spam/ham) based on observed events (words in a message). Using Bayes theorem we can write:

P(outcome | events) = (P(events | outcome)× P(outcome))/(P(events))

Given that all outcome probabilities are normalized by P(events), we can just ignore the denominator and thus compute a score S:

Score(outcome | events) = P(events | outcome)× P(outcome)

If we assume uniform probability of outcomes (because it is hard to estimate and it can fluctuate), we can remove the common factor:

Score(outcome | events) = P(events | outcome)

If we also assume (notice we are making a lot of assumptions… that’s why the algorithm is called naïve!) independence of the different events, then we can write:

Score(outcome | events) = P(e_1 | outcome)× P(e_2 | outcome)×…

Where e1, e2, …, are the single words present in a message. At this point, we only need to compute P(e_i | outcome), that is the probability of a single word appearing in a message, given that the message has a specific outcome. To do this, we only need to take all samples from the training set in the desired outcome class, and count the proportion of occurrences of this event with respect to the number of all events in the class:

P(e_i | outcome) = (#e_i in outcome )/(#all events in outcome)

Since multiplying very low probabilities together would most likely result in a float underflow[2], we can change the formula Score assuming that above into an equivalent one which computes instead the sum of the logarithms and exponentiate it:

Score(outcome | events) =e^(ln⁡ P(e_1 |outcome)+ln⁡ P(e_2 | outcome)+ … )

Finally, since the exponentiation is a common factor, we can also remove it and keep only the sum of the logarithms:

Score(outcome | events) = ln⁡ P(e_1 |outcome)+ ln⁡ P(e_2 | outcome)+ …

The final prediction is thus based on the comparison between two scores: Score(spam | events) and Score(ham | events), assuming that the events contain the words of the message to classify. The highest score corresponds to the classification.

What does this look like in code?

As we have just seen, to decide if a message is spam or ham based on the words it contains, we need to compare the scores of both spam and ham given such words and take the outcome with the highest score. In code we do it as follows:

Here we are considering the sequences of allOutcomes (spam and ham) and for each outcome o we compute the score of it, given the list of words (the events). We return the outcome related to the highest score through the use of maxBy.

But how do we compute the score of an outcome given a list of words? As we have seen earlier, we need to sum the logarithms of the conditional probabilities of each single word, given the outcome. In code it looks as follows:

The last step is then the computation of the conditional probability (condProb in the code above) of a specific word (e) given an outcome. This is simply computed as the proportion between the occurrences of that word in messages with that outcome and the total occurrences of words in messages with that outcome:

Note that eventOccurencesByOutcome and totalOutcomes are maps which are filled in after reading the dataset with all labelled messages and contain respectively:

Given an outcome and an event, the occurrences of such event (word) in messages with that outcome.
Given an outcome, the total word occurrences in messages with that outcome.

From “vanilla” to “working” in a concrete scenario

In the previous paragraphs we have shown and explained the core of the algorithm. However, to get a well working implementation, we need to consider a few other important things.

Preprocessing

First of all, how do we store our training dataset? Most likely, the simplest way to go is to create a csv file with only two columns: in the first one we write the outcome (spam/ham) of a certain message, and in the second one we write the actual message. Since the comma is the character used to separate the columns of a csv file (and when we import the data in our program we need to be able to split unambiguously into columns), an important pre-processing step is removing the commas present in the messages.

In our program, after loading the data from this csv file, we will need some other pre-processing to avoid our algorithm to be confused by unimportant bits. For example, we should remove all uninteresting characters from the messages (!, ?, ., ;, &, _, new lines, etc.), transform the text to lower invariant (otherwise “GREAT” and “great” would count as two different words!) and remove words which are too short to be interesting. In our case we only removed words made by one character, but depending on the context this could be changed to two or three.

Creating a test set

If we used our entire dataset to train our model, then we would have no idea whatsoever about how well that model works… In supervised learning, the approach to solve this problem is to create a test set, which is separate from the training set. This means that, given our entire dataset of labelled messages, we will split it in two: most of the messages will form our training set, while a small subset will be “kept for later”, to test the accuracy of our model (by comparing the classification made by the model with the “real” classification). How many messages should we keep as test set? A magic number does not exist, and the choice depends also on the size of the original dataset. We want to use as much data as possible to create a good model, but we also need a decent amount of messages to be able to test it. In our case, we extract 100 messages as test set from the original dataset (which contains approximately 11,000 messages). Note that these 100 messages are not extracted completely at random: we want 50% of them to be spam and 50% to be ham, so that we can test our model thoroughly on both outcomes to avoid bias.

Note that this random split between training and test messages happens every time we train a new model, so the accuracy of the model trained will be slightly different every time (depending on which messages were selected as test set[3]).

Performance

Given the amount of computations made by the algorithm (we assume that the dataset will be big, otherwise the algorithm doesn’t have enough data to learn from), it’s quite obvious that performance could be an issue. This means that when we are implementing such an algorithm we must be careful with which data structures we choose. A suggestion is to use maps as much as possible: when we look for something (for example, the occurrences of a word in messages with a given outcome) the difference between iterating a list and doing a look-up in a map is huge. [4]

Currently, our implementation takes under 5 seconds to train a model on a dataset made by 11,000 messages. Given a trained model, classifying one message is instantaneous.

Extracting information to evaluate the results

We are now able to train our model and to compute its accuracy on the test set. As hinted at before, the accuracy is simply the percentage of correctly classified messages out of the test set. The questions that we can ask ourselves now are many: is the accuracy good enough? Which kinds of mistakes are we making (false positives or false negatives[5])? When we make a mistake, why does that happen? And finally:

Knowing the answers to all previous questions, is there something we can do to get a higher accuracy?

This is the point where, if we were using an existing library, we would most likely get stuck, since it would be difficult to get access to the inner computations of the algorithm and to modify them. In machine learning a “DIY approach” is certainly more difficult in the beginning but it becomes rewarding later on. In our case, having implemented everything from scratch, we have access to every single computation and intermediate numbers, meaning that we can inspect what’s happening and fix/adjust it where needed.

Our first accuracy was, on average, around 70% and, even though in machine learning you cannot aim at 100% and should accept the existence of mistakes, we felt that this was not good enough for our application and it was worth investigating why (with the goal of improving, of course!). The next step was thus to create a file containing all sorts of information on the mistakes being made by the algorithm. In particular, for each messages wrongly classified we printed the following:

The entire message
The mistake type: false negative or false positive?

It is important to understand what a false negative/positive is and what are the implications for our concrete application. A false negative is a spam message which is classified as ham. A false positive is the opposite, that is a ham message classified as spam. Which one is worse? In our case, when a message is classified as spam by the tool, someone will be alerted to check it out (and, if spam is verified, there will be some consequences like, for example, deleting the message). If a messages is classified as ham, nothing happens and the message is “ignored”. Since we would really like to find out all spam messages (which are anyway a small subset out of the entire message set), we prefer “wasting” a few seconds checking out some ham messages wrongly classified as spam, rather than letting some spam messages flow unidentified through the system. This is why we prefer to have false positives rather than false negatives. The same accuracy percentage could thus be acceptable (in our context) if all mistakes are false positives, but we should be careful if many mistakes are instead false negatives.

The scores associated to both outcomes (ham/spam)

If the two scores are very close to each other, it means that the classifier is unsure and this could be due to messages which are actually really difficult to classify. We should of course read the message to verify this hypothesis.
If the two scores are instead quite apart, then the classifier is really convinced of its mistake, and it is absolutely necessary to look into why that happens.

For each word of the message, the contribution (logarithm of the conditional probability) that such word made to the score computation, for both outcomes (ham/spam) and the absolute difference of these two contributions

If the absolute difference is very close to 0, it means that the word does not help us much in distinguishing between ham and spam, since the conditional probability of appearing in ham or spam messages is almost the same. The higher the number, instead, the more this word is interesting as there is a clear cut difference in the probability of appearing in spam or ham messages.

For each word of the message, the occurrences of such word in the training set

Whenever statistics is involved, we should always remember one of its fundamental rules: “the more data you have, the more accurate you will be”. It is very risky to rely on statistics based on too little data. This is why we decided to look at the amount of times a word was present in the training set. We had the intuition that accepting the contribution given by a word which appears just a few times in thousands of messages could reduce our precision and reliability.

The resulting file containing all this information looks something like this:

As reference for the values seen above, consider that the word “www” (which indicates with almost certainty a spam message in our context) has the following associated numbers:

Contribution spam: -4.630538878 [6]
Contribution ham: -8.793873874
Absolute difference: 4.163334996
Occurrences: 412

As you can notice, the occurrences are quite a lot (so we can rely on the information brought by this word), and the absolute difference of the contributions is pretty high (so it is a strong hint towards one of the two possible outcomes): “www” is definitely an interesting word.

Improving the accuracy

After executing a few times[7] our “results analyzer” and looking carefully at the information printed out, we were able to gather the following conclusions:

1. First of all, even in the first iteration of our implementation (so certainly not the most accurate one), we were able to detect a few mistakes in the labelling of the training set: that is, some messages had been wrongly classified by the humans working on the dataset. This was actually good news for us: a few of the mistakes made by the classifier were not actual mistakes!

2. The (real) mistakes made by the classifier were mostly false positives (which is a good thing in our context, as explained above). However, sometimes a false negative mistake was made, even in the presence of quite clear clues, like the words “website” or “www” appearing in the message! After looking closer to what was happening there, we saw that the relatively big contribution of an interesting word (for example a difference in scores of 4 for “www” towards the outcome SPAM) was trumped by the sum of many smaller contributions in the other direction. Especially in long messages, a list of words which we could consider as quite neutral (in the sense of a small difference in contribution with respect to the two outcomes) could make a difference because of their combined effect.

3. In general, we noticed that many words (which, from our perspective on the specific context, sounded rather neutral) had a very low occurrence value and/or a very low absolute difference of contributions.

4. Finally, we also observed that words written in slightly different ways but with the same meaning or in the same semantic area (for example think about “contact”, “contacting”, “contacted”… or simply singular and plural versions of the same word like “discount” and “discounts”), were considered as separate words by the system (which has no idea about language semantics) and were thus giving separate contributions to the computation. Moreover, if the same word appeared multiple times in a message, its contribution was counted for each of the appearances.

Actions taken as consequence of the analysis

The first action taken as a consequence of conclusion 1 was, clearly, a manual check[8] of the entire dataset to correct the human mistakes made in labelling the messages.

Secondly, to address conclusion 4, we decided to remove duplicate words when a message has to be classified (that is, we use a set of words instead of a list). Moreover, we implemented a stemming algorithm, which reduces derived words to their base/root form (the word stem). The stem does not need to be a word, for example the Porter algorithm reduces argue/argued/argues/arguing and argus to the stem argu.

To address conclusions 2 and 3 (the most crucial source of mistakes) we went creative, mixing knowledge from other machine learning algorithms with a sprinkling of common sense. We decided to create our own definition of “interesting word” (manipulating the numbers printed by our results analyzer, as we will soon explain) and to change the algorithm so that a message that must be classified will be first filtered to remove the words which we deem uninteresting. It is of course possible that a message becomes empty after filtering and in this case we decided to always give ham as classification. The reasoning behind this is that, if there is no interesting word in a message, then it’s clearly an uninteresting message, thus the ham label. If, instead, the message still contains some words after the filtering, then we apply the usual computation of Naïve Bayes and we compare the scores of the two outcomes (ham/spam).

The last topic we should discuss is thus: when is a word considered interesting?

Our definition of interesting is based on two considerations:

1. As mentioned earlier, we don’t want to draw our conclusions using clues for which there is not enough evidence. This translates into wanting to exclude words which appear only a few times in the dataset. We will thus make a check on the occurrences value of each word appearing in the message to classify and, if this word wasn’t present enough times in the dataset, we ignore it. For this we need to establish a threshold of acceptability. Looking at the concrete data printed by our results analyzer, we decided to start with 10.

2. Even more importantly, we would like to exclude words which are not a clear distinguishing sign between ham or spam messages. The absolute difference in the two contribution values is an indication, but we would like to base our decision on sound statistics. A very known concept in machine learning (used in the core algorithm to build decision trees, for example) is that of entropy. Generally, entropy refers to disorder or uncertainty and it is computed as the negative logarithm of the probability distribution of a stochastic source of data. A higher value of entropy corresponds to more uncertainty/unpredictability. For example, tossing a (fair) coin has the maximum uncertainty possible since it is a uniform distribution (the two outcomes have all the same probability of happening, that is 0.5). The entropy in this case has value 1. The opposite situation could be, for example, a coin with head on both sides. In this case there is no uncertainty, as head will be the result every time we toss the coin. The entropy value is 0. How do we apply this knowledge to our context? We can compute the entropy of words! The probability distribution we will refer to, then, is the probability of appearing in spam and ham messages[9]. Like it happened in the coin example, a word which appears half of the time in ham messages and half of the time in spam messages would have the maximum entropy, 1, and it would be not informative at all for us. Instead, a word which appears only in one category of messages (spam for example) would have an entropy of 0 and it would carry very useful information for us. Also in this case we will need to use a threshold to establish if the entropy of a word is low enough to make it interesting. After printing the entropy of many words in the dataset, we decided to start with a threshold of 0.6.

Final numbers

For practical reasons, all the improvements discussed above did not happen at the same time. For example, even if we noticed the manual mistakes immediately, it took a while before we could have the re-checked dataset and in the meantime we worked on other features.

1. The first improvement actually implemented is the last one we discussed above: filtering away uninteresting words and classifying messages only based on the remaining words. This made a big difference: we went from an accuracy of approximately 70% to slightly above 80% (using 0.6 as entropy threshold and 10 as occurrences threshold).

2. After that, we made a couple of small adjustments. Looking at the values printed out, we noticed that the occurrences threshold could become a bit bigger, and setting it to 15 brought to approximately 85% accuracy. Then we filtered away duplicate words from test messages and the average accuracy rose to slightly above 86.

3. We noticed that many neutral words were still included in the computations, so we decreased the entropy threshold to 0.5. This had a big impact, as the accuracy went to almost 90%.

4. At that point, the dataset had finally been fully rechecked. With the updated dataset, the accuracy was above 90% most of the times (and in some executions we even saw peaks of 95/96%).

5. The stemming algorithm was also ready at that point, and this granted more stability to our classifier, as the accuracy stopped oscillating between executions and was almost always above 90% (and with even higher peaks: 98%!).

6. Lastly, with the intuition that the stemming algorithm should have made interesting words even “more interesting”, we reduced the entropy threshold to 0.4. The average accuracy is now 94.6% (with a peak of 99%)!

7. Before deciding to stop (95% average accuracy looks quite satisfactory), we made one last check to ensure that the amount of words remaining after the entropy filtering is not too low. We made this check to be sure we are not committing to a too small “vocabulary”, which could become a problem when we will classify new messages never seen in the training set. With 0.4 as entropy threshold, the amount of words kept is around 3000, which looks like a good number (also compared to the amount of words kept when the threshold is, for example, 0.3 or 0.5).

Summarizing: by getting our hands dirty in the intermediate computations of the algorithm and by being able to customize the implementation for our specific context, we made a huge jump in accuracy, going from the initial 70% to 95% on average!

Conclusion

In this article we explained how we approached the problem of building a spam classifier for text messages in the context of Gites.nl. We decided to use a well-known and fairly simple technique, that is Naïve Bayes. We implemented it ourselves (instead of resorting to use an existing library) in order to maintain full control over the whole classification process. This decision proved to be a good one, as we had to make a few custom changes to the algorithm, in order to tailor it to our context. For example, we combined this technique with another well-known concept in the field of machine learning, that is entropy. We also made sure to not be over-fitting (that is, creating an algorithm which works perfectly on the training dataset but gets easily lost when receiving unknown data), for example by implementing a stemming algorithm and by making sure that our custom-made filters (excluding “uninteresting words”) wouldn’t cut out too many words from our vocabulary.

The final result is thus not only a working application which classifies correctly approximately 95% of the messages of Gites.nl, but also, more generally, a custom implementation of Naïve Bayes which we will be able to easily apply (and adapt as needed) to other contexts.

Notes

[1] The label name ham is commonly used to indicate the opposite of spam, that is a valid message.

[2] It is not possible in most floating-point designs to store a too-low value. Since probabilities are all below 1, when we multiply two of these together, we get a number which is lower than both inputs. If we continue multiplying by other numbers still lower than 1, then the result becomes smaller and smaller and after a while will be so close to 0 to not be representable anymore by a floating point.

[3] If the accuracy is VERY different between executions, then this could mean that the test set is too small (and thus the random selection has a huge impact on the measured accuracy) and it should be investigated further.

[4] Most of the computations in the algorithm are based on look-ups, which means that this is the crucial point to optimize. If you are familiar with the big-O complexity notation, then you should know that searching for an element in a list is O(n), that is, the time to find an element is proportional to the number of elements (n) in the list. On the other hand, F# maps are implemented as immutable AVL trees (self-balancing binary tree). AVL trees are well-known for their efficiency and searching for an element in these trees is (O(log n)), which means that the time to find an element is proportional to the logarithm of the number of elements in the tree. The logarithm of 100,000 is approximately 16: quite a noticeable difference!

[5] Depending on the context, one kind of mistake will have worse consequences than the other. This means that, even if the accuracy is not as high as you would like it to be, it could still be acceptable if the only kind of mistakes made are those with less severe consequences.

[6] Note that both contribution values (spam and ham) are negative. This happens because they are computed as the logarithm of a probability. A probability value will always be between 0 and 1, and when computing the logarithm of a value between 0 and 1 the result will always be a negative number.

[7] It’s better to execute the classifier multiple times since, as mentioned before, at every execution a different test set is extracted and thus results might vary.

[8] As a sociological note, we discovered that manually checking thousands of messages makes interns very happy!

[9] One might wonder why we did not just use a simpler metric, such as the absolute difference between the two contribution scores. Since the entropy computation is based on a probability distribution, it is also robust for applications where the labels/classes are more than two, meaning that the technique can be applied in a more general context.