EML Blog #6: Creating Value
Ever since the invention of email, there has been an arms race between spam email and filters designed to catch spam emails. It is a complex problem, and one which machine learning has tried to help mitigate. 97 billion spam emails are sent every day, 973 million of which contain malware.
According to one study, spam email costs society $20 billion every year. It takes human effort to create the software designed to filter out those spam emails, and to delete the ones which get through. Malware removal takes time, and recovering files from distributors of ransomware can take time and money. Machine learning algorithms are a key candidate for reducing those costs by improving spam detection algorithms.
The first spam messages sent via email across the Internet as we know it were, surprisingly, advertising lawyers’ services. They were a small fraction of the emails sent in 1994, but today, of every ten emails sent, eight are spam. If this torrent of spam emails were allowed to land in the inboxes of everyone with an email address, the service would be rendered unusable. Internet service providers (ISPs) tried whitelisting legitimate email senders in an attempt to stem the tide, but this was ineffective.
Modern spam filters use machine learning methods to identify spam emails. The problem is straightforward, if difficult to implement. An email is either spam or legitimate, and so a classification system is required to attempt to determine which it is. Many features may be extracted from an email in order to help determine its legitimacy, including the frequency of usage of certain words and certain message headers. There is also additional consideration given to what kinds of mistakes the classifier should be permitted to make, and ones which it cannot afford to make. Specifically, false negatives, in which a spam email is not identified as such, is a mere annoyance; false positives, on the other hand, can cause serious problems if an important email is marked as spam without the recipient’s knowledge. As the algorithms improve, their tendency to make these mistakes decreases.
Failure is an important part of how these algorithms learn. Every online email client features a button to mark a false negative as spam, or to mark a false positive as not-spam. These buttons provide feedback to the algorithm, allowing it to learn to identify new kinds of spam, or to learn to better identify old ones.
Note: This paper was linked above inline, but looks like a good source. It may be useful in the future, to anyone interested in using machine learning methods for classification of text documents, or some related task.