How do spammers harvest your e-mail address?

26 Mar 2014

This is an update on the research I did under the guidance of Jeff Huang (now assistant professor at Brown University) titled How do spammers harvest your e-mail address?.

As an experienced ex-blogger and an avid internet marketer, I can firmly say that every Internet user with an email address has received spam email more than just once. Spam is not just unsolicited, unapproved contact by a stranger; sometimes spam can lead to loss of money and even theft of identity. As spammers get more sophisticated, it becomes difficult for anyone to differentiate spam emails from genuine emails. Companies all across the world, irrespective of their market, location or size, incur loss of resources, human power and money due to spam. Tech companies, like Google and Yahoo, use about 30 billion watts of electricity (1) — that’s enough electricity to power 3 million houses for a year. It’s amazing just to think about how much energy and money these companies would save if there was no spam.

But it’s not just the companies that suffer losses from email spam. People like you and me are spending a lot of time reviewing spam, and even with strong spam filters they still have to check their spam folder for misclassified email. Everyone has had the problem where an important email fell into the spam folder.

Back in August 2012, I stumbled upon this amazing research opportunity presented by the Information School at the UW — to be a part of the research to find out how spammers harvest peoples’ email addresses, and how it can be prevented. I instantly related to this research since I have been at the receiving (and sadly, sending) side of spam emails, and I understand what a spam email really costs — nothing to the sender, a lot to the receiver.

I was super excited to get started with this research project. We were super lucky to be funded early on Experiment.com and I worked on this for about a year. We were looking at the root of the problem — how spammers get your email addresses in the first place. To make a medical analogy, there is treating the illness after it’s already happened, and there is learning more about the root cause of the illness so it can be prevented completely.

Now, there have been many such studies in the past (see Related work below), so how is this study different from the others? For one, this is new, raw data we collected. The Internet space moves so fast, that to be ahead of spammers and scammers, you have to find out how they operate — and how they operate changes all the time. Secondly, most other (newer) studies are not as comprehensive as this one is. They usually just cover social network spam, or sites selling email lists. We tried to cover as much ground as possible, and collect as much data as possible to back us up.

So, here are the results we got after one year of waiting and after posting email addresses to numerous different types of websites and online services.

First, a cloud.

This is a simple cloud of the words spammer use the most in their email subject lines. The most used words, as seen, are “free”, “credit”, “score”, “home” etc. I expected “viagra” to be one of the top words, but I guess they are not using it in subject lines to bypass spam filters.

And now, the meat of the study. Or should I say the spam. ;-)

First, where is our data coming from?

Emails were posted to a variety of different platforms in different quantities. Each email was only posted once on the web to make sure we get very pristine data.

A human-readable list of the platforms with the number of emails posted there:

  • App store reviews (Apple, Chrome, Firefox etc): 4

Since we wanted to study how to prevent spam, I also tested a lot of different email obfuscation techniques, and the graph below shows the distribution of them.

(See what they mean)

And a human-readable list of obfuscations with the number of emails that were used for each one:

  • splitting: 6

I also tried some different ways of spreading email, like making email anchored vs plain text or clickable vs bare text.

So what’s an anchored email link? This. And what’s not? This -> karan@goel.im. In some cases, for example when emails were embedded in images, it’s not valid. That’s “invalid”.

The way we classified emails as “clickable” or not follows almost the same rules — this this clickable but karan@goel.im is not.

How how much spam did we get, and when?

At the time of producing these charts, we got about 18000 spam emails across all 1000-ish emails posted.

As seen in the graph, the amount of emails per week peaked after 20 weeks. This 20-week period could be the time interval it took search engines to crawl, index and rank the pages where these emails were posted.

Also see the gallery of charts for emails per week by different platforms.

As seen in the timeline above, there’s an influx of weekly spam in late October (that’s 2 months after the study began), and during the holiday season. These were the times when email posting activity was at its peak, so seeing more spam is expected.

The weekly activity of spammers seems pretty consistent and not surprising. Early weekdays are busy, but activity settles down a bit on the weekends.

Not many people check their emails on weekdays, so no point in wasting bandwidth, right?

Platforms

As seen in the box plot below, almost all platforms I posted to sent spam — whether they were publicly indexable by search engines or not, every website has the potentiable to be scraped by spammers.

Notably, spammy mailing lists send the most spam. These mailing lists include sites that promise you free credit scores, or insurance quotes, or free ipads etc. These sites stink of spam, but people still continue to give them their email addresses.

Surprisingly, emails in whois details of a domain also sent a ton of spam, even though most whois services hide emails as an image. Why is that? Because those sites that do not obfuscate email addresses are enough for spammers to harvest emails from.

Some platforms that sent no spam at all:

  • App Store (Email posted in reviews)

Obfuscation

(I can finally spell this word!)

I used a ton of obfuscation techniques)) to test as many of them as possible.

Simple name mangling (email [at] irchiver.com etc) works no more, sadly. Some more modern ways do work — ROT13, ASCII characters etc seem to work the best. To use them, though, you have to use scripts or software which might not be always available.

Obfuscation strategies worked the best and sent no spam at all:

  • <span> splitting (In HTML, parts of email in different <span> tags)

So how should I share my emails online?

I’m glad you asked. Our study indicates that spammers are getting more and more sophisticated, and getting access to even private pages and databases to harvest emails from. Their parsers work better than ever, and simple email mangling is just not as efficient now.

So is there no way to beat spammers? Of course there is.

  • If you’re posting an email address on a website whose source code you can control, use a cipher or ASCII or Unicode encoding. This makes your email look like normal text, but the source is nothing like normal text.

​At the end of last summer, when I was crunching the data set for interesting finding, I discovered that for some reason, some of the data was tainted by Gmail. Jeff had set a filter at the beginning of the study to redirect all email (including and especially spam) to inbox directly. However, Gmail randomly decided to not do that, and so we lost some part of the data (spam is deleted by Gmail every month). We do not know how much data we lost (I estimate about 5–10%), but it is enough to prevent us from publishing a paper. ☹

I have made all scripts used to analyse the data available on Github. This code was written when I was just starting to learn Python, so please excuse the code quality.

What do you think of our research? Reach out to me via Twitter @KaranGoel) or via email: karan@goel.io

References:

(1) James Glanze, “Power, Pollution and the Internet” (NYTimes), 22 September 2012 http://www.nytimes.com/2012/09/23/technology/data-centers-waste-vast-amounts-of-energy-belying-industry-image.html

Related work

Kyumin Lee, James Caverlee, Steve Webb, “The Social Honeypot Project: Protecting Online Communities from Spammers”, 2010

Craig A. Shue, Minaxi Gupta, Chin Hua Kong, John T. Lubia, Asim S. Yuksel, “Spamology: A Study of Spam Origins”, 2009

Center For Democracy & Technology, “Why Am I Getting All This Spam?”, 2003


Originally published at karan.github.io.

Karan Goel

Karan builds and ships software at Google and writes about…

karan goel 🚶🏽

Written by

Engineering the best ☁ @Google • CS @UW ’16 • Interned @Mattermark (W16), @Google (S15, S14), @MadronaVL (Sp15) • Founded @DubHacks

Karan Goel

Karan builds and ships software at Google and writes about tech, startups, finance and his own life.

karan goel 🚶🏽

Written by

Engineering the best ☁ @Google • CS @UW ’16 • Interned @Mattermark (W16), @Google (S15, S14), @MadronaVL (Sp15) • Founded @DubHacks

Karan Goel

Karan builds and ships software at Google and writes about tech, startups, finance and his own life.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store