Are you losing genuine Analytics data, thinking you’re filtering spam?

This article was published first on Moz.com.

“Don’t throw the baby with the bathwater.” A saying that is ageing from 1512, yet it is still not out-of-date. Actually, it has become very relevant in the whole discussion around solving spam in Google Analytics. Frustrations caused by the spam issue have lead to losing genuine data.

“We’re losing genuine data, 
just to get rid of the spam.”

Spam in our Analytics could be the single most irritating thing in online marketing right now. Numerous blogs have written about it and one solution has been surfacing as the fastest solution to get rid of it. “Only 2 filters and your free for ever”. This strategy is mainly based on including only valid hostnames to filter ghost spammers, the most aggressive type of spam.

Even though it is a valid option, it is also the most risky option in filtering genuine traffic. Losing valuable data and insights in your marketing performance and conversion.

Why is this valid option risky?

Using the ‘two filters option’ is risky, because it uses inclusion instead of exclusion and because it marks an unset hostname as spam.

Inclusion versus exclusion

  • Inclusion: Only allow data from known genuine sources
  • Exclusion: Only filter data from known spam sources

What’s a hostname?

The hostname always tells you through which domain your website was visited.

This can be any (sub)domain you claimed, like www.mydomain.com, mydomain.com, blog.mydomain.com or mydomain.co.uk. However, the hostname could also be the domain of translate services, cache services or shopping services (like translate.googleusercontent.com or paypal.com).

This risky strategy is perfect in a vacuum, but we have seen too many cases where this strategy could have gone terribly wrong:

  1. Over a span of months/years, you work with multiple people and agencies. They don’t always know what the other one set-up. Do you?
  2. The internet and your business will evolve and more genuine sources will appear. Who will make sure they are always included from day 1?
  3. Plus, a minor technical error in your code may cause your hostname to be ‘not set’. This would make your genuine data ‘look’ like spam and not pass the inclusion filter, without you even realising it. You want a minor bug to corrupt genuine data like that?
“Inclusion is perfect in a vacuum, but we have seen too many cases where this strategy could have gone terribly wrong.”

Real-life data needs a real-life solution

With the inclusion strategy any of above real-life scenarios causes you to lose genuine data.

In fact, one of our clients would have deleted all of its conversion data if he used the two filter solution. Solely because of a third-party plug-in that was implemented by another agency.

The plug-in created a new ‘session’ without hostname-data that would hold the conversion data, instead of the real session:

“One of our clients would have deleted all of his conversion data if he used the two filter solution.”

So what is the alternative?

The alternative is working with exclusion. Only filter spam when you’re 100% sure it’s spam.

Working with exclusion also has its downsides of course:

  1. You have to make sure your exclusion filters are always up-to-date with the latest spammers.
  2. And it would mean that you allow some spam can enter, for instance visits with a unset hostname that actually are spammers.

Based on the data in our clients’ accounts, these spammers account for 0.4% on average of all traffic (based on our 5 biggest accounts — ranging from 0.1% to 1,3%).

Meaning that your Analytics, on average, would keep an accuracy of 99.6% without risking losing genuine traffic.

Additionally, this type of spam hasn’t been monitored since the end of July for any of our clients’ accounts. Meaning that the ‘corruption’ of data will diminish in our reports.

“Using exclusion would keep an accuracy of 99.6%, without the risk of losing 
genuine traffic.”

Now back to you

So, what’s your take?

  • You rather filter spam, knowing that it could include genuine traffic or corrupt your data on the longer term?
  • Or you want to only filter real spam, knowing that it could include a little bit of spam?

This research was done for our development of www.referrerspamblocker.com. A free tool to automatically filter all your Google Analytics accounts from all known spam, using exclusion strategy based on source / medium data.