No Spam for You!!!

Published in

TL;DR Innovation

4 min readFeb 22, 2018

Unified E-mail Filtering Technology

My e-mail to our magazine staff containing installments of this contributed column stopped being delivered a while back. There was no returned mail or out-of-the-office notification, just — well, just nothing. It took some time to realize our line of communication had been broken, and we had to regress to the good old handshaking technique of phone calls to make sure they got through. After a bit of snooping, it was discovered that I, or rather my exuberance, was to blame. Considering e-mail as a less formal mode of communication than post mail, I fell into the habit of signing off with the phrase, “Thanks!!!” — three exclamation marks and all. Unfortunately, to an unwanted e-mail or “spam” filter, that closing closely resembles “Viagra!!!” or “Get Rich!!!” and thus my e-mail was quickly shuttled off to the nether regions of ether space.

Filtering unwanted signals or noise has long been an important topic in data acquisition and instrument design. Noise can mask the signal, making it difficult to find, and noise filtering can attenuate or remove the signal altogether. Commercial e-mail providers and software companies have developed or adopted differing methods of spam filtering and they each have their strengths and weaknesses. Dr. William S. Yerazunis, Senior Research Scientist with Mitsubishi Electronic Research Laboratory (MERL) in Cambridge, MA, recently addressed the different filtering methods at the 2005 MIT Spam Conference. Along with his colleagues at University of California Riverside, Freie Universitat Berlin, and Embratel, Brazil, he classifies the current methods into three primary types and proposes that they are simply special cases of a common, unified approach to e-mail filtering.

One e-mail filtering method is simply to block all e-mail from an address contained on a blacklist. After a server has been determined as a source of spam, its address is added to a local blacklist or to a larger list maintained by a third party. This filter method is 100 percent effective against spam from the site; however, in the logical endgame all servers are blocked and no mail is delivered. A second filtering method is heuristic filtering, wherein a human examines spam and non-spam e-mail and determines “likely features” that are used to trigger or mark a message as spam; much like my “If text contains (!!!), then mark as spam.” A third method is statistical filtering. Similar to heuristic filtering, a human classifies a group of messages as spam or not spam, but the rules of thumb are generated by an optimization algorithm based on a statistical analysis of the training set, such as Bayesian classification.

Dr. Yerazunis suggests these filtering methods follow a proposed six-step filtering pipeline and simply employ different versions of the common components:

Initial Transformation: This first step may include forcing exotic characters into a basic character set, unpacking MIME encodings into a common representation, and HTML de-obfuscation by removing nonsense tags that are invisible to the human reader, but can be inserted to break up “spammish” key words.
Tokenization: A regular expression (regex) is used to segment the message into text strings that are converted into unique values using a look-up method.
Feature Extraction: The tokens are grouped into meaningful finite sequences (tuples) based on the words they contain or the order in which the words appear in the message.
Feature Weighting: This step is based on the prior training of the filter to rank the importance of the tuple found in the message. The weight can be determined by how often the tuple has been found in spam messages, how closely the tuple resembles a known spam feature, and the size of the training set.
Weight Combination: The weights of the found features are then combined to determine the overall likelihood of the message being spam. This can be a simple linear addition of weight values, or a sophisticated nonlinear method such as one that considers the relative strengths of the sorted weights, a Bayesian combiner that considers the probability of the message being spam before and after a weight is considered, or a chi-squared method that compares the observed number of spammish tuples with an expected or acceptable number of rogue tuples.
Final Thresholding: After the weights are combined, a final “spam/not spam” decision is made based on the final value. For the statistical methods, the final threshold is often 0.5 (50 percent); however, the actual value can be adjusted by the filter designer to tune it for optimal results.

In addition to unifying the description of many current spam filters, the proposed filtering pipeline closely resembles the design of a McCulloch Pitts neural network that connects multiple inputs (the tuple values) through weighted connections into layers of artificial neurons whose values are thresholded to compute “yes/no.” Perhaps recent developments into artificial neural networks can add a bit more intelligence into the spam filtering process and permit me to keep my exuberance!!!

This material originally appeared as a Contributed Editorial in Scientific Computing and Instrumentation 22:10 September 2005, pg. 14.

William L. Weaver is an Associate Professor in the Department of Integrated Science, Business, and Technology at La Salle University in Philadelphia, PA USA. He holds a B.S. Degree with Double Majors in Chemistry and Physics and earned his Ph.D. in Analytical Chemistry with expertise in Ultrafast LASER Spectroscopy. He teaches, writes, and speaks on the application of Systems Thinking to the development of New Products and Innovation.

No Spam for You!!!

Written by William L. Weaver