I am a naive Bayes classifier 

arthur johnston
2 min readDec 22, 2013

--

My bank sends me a lot of mail, this comes as no surprise, monthly statements, credit card bills, notifations of ‘deals’, federally required notification that they’re selling my personal data, etc. Since I pay online and the same day every month I realized I let this mail pile up and once every few months I batch go through and handle all the junk from all my vendors at once.

By ‘handle’ I mean throw most of it out without opening it. Last time I did this though when I got to the bottom of the pile I noticed a piece of mail from them, that for some reason I opened it. I opened it and it was a check they had sent to me due to a fee they had mis-charged me on an account that I had closed.

After I cashed it I was thinking “Why did I open this?” It was a split second decision but reviewing my thinking it went.

  • From my bank spam
  • Weird size not spam
  • From a different address not spam
  • Very thin not spam

And I do this with all of my mail, figuring out if it’s junk or not. All in the course of a minute or two.

If you’ve been on the internet more than a few years you remember when spam was a real issue, one that people thought might break the internet, whole posts on Slashdot would be filled with people posting complicated solutions that wouldn’t work. Then the problem of spam just went away, people still complain about it, but it’s like comedians complaining about airline food, everyone agrees spam is annoying but no one cares because it’s no longer a critical issue.

One of the suggested solutions that actually got implemented was to use Bayesian filtering. The Wikipedia article does a good job of explaining the technique. But to summarize it looks at each word in the email and tells you what are the chances the email is spam, given it has the word in it. So an email with the “Sexy” is more likely to be spam than an email with the word “funeral”

When I looked at the physical mail in that split second what my brain was doing would best be called a naive bayesian filtering.

--

--