Google Analytics modifies your data

Timothy Carbone
5 min readOct 16, 2017
Google Analytics on mobile by Edho Pratama

Google Analytics is the leading analytics tool out there. Leading by the number of companies and websites using it.

It’s a good tool to start getting your first insights very quickly, whether you’re working with a really small team or alone. It will help you to take first level marketing decisions and you’ll be able to monitor a part of their impact.

If you’re a data person, you might consider moving away from it. The reason is really simple: the data philosophy behind Google Analytics is wrong, terribly wrong.

Hey! What’s a data philosophy?

Your data philosophy is the sum of all the decisions you take when you evaluate your data. Here is a subset of questions that would likely describe how you feel about data.

  • Do you care about statistical significance? Or is high volume enough?
  • Do you care about bias in 1%, 5%, 10% of your data?
  • Do you care about 1%, 5%, 10% erroneous data in your analysis?
  • Do you care about that day you were missing data for ?
  • Would you rather make a decision or say “I don’t know” until you’re 100% sure?
  • How certain do you need to be before saying “I know”? 90%, 99%, 100%?

There’s no right or wrong answer. Depending on your activity and the importance of accuracy, your choices may vary. One thing for sure, you’ll be the one making the choices.

With Google Analytics, guess what? You’re not.

Google Analytics takes a lot of decisions on your data, and they’re not conservative at all.

The only time you see “I don’t know” in GA is literally when the tool has no clue what’s happening. As soon as there’s a hint somewhere, GA will use it to infer what happened. The fact is, Google Analytics would rather give you wrong facts than admit it doesn’t know. And if you’re not aware of that, you might just consider the data accurate, because GA will never tell you which decisions it took.

If we get back to the data philosophy, this is certainly a very special one. I’m pretty sure that taking arbitrary decisions affecting the data itself then using that inferred data to display insights as facts is not worthy of a decent data analyst.

There are already countless articles scattered around the web explaining why some data concepts are handled quite terribly by GA (sampling, session duration …) and I don’t want this to be another one. Instead, I’ll use one single example to illustrate its data philosophy.

GA decides what acquisition channel the user came from.

Yes. You don’t believe me? Look at this chart from Google Analytics documentation.

Google Analytics processing flow chart for user acquisition

When GA doesn’t know where a user came from, it’ll try to find a previous acquisition source for the same user. If the user came from a known source within the last campaign timeout period (which defaults to 6 months), it will reattribute it again. Let’s use an example to show how this works:

  • John searches for something in Google and finds your website
  • John really appreciate your website and stores it in his favourite websites
  • Every day, John returns on your website directly

Well, you guessed it. GA can’t tell where John is coming from every day because John is coming directly. So GA searches for a previous acquisition source for John and finds… Google Search. From there, GA assumes that every time John arrived on your website directly, he was actually coming from Google Search. GA doesn’t tell you the whole story and shows you that your organic traffic is through the roof because John comes from Google’s search every day. It will also tell you that your direct retention is not that good… when it actually is!

NB: I’ve been using Google Search as example but there’s no conspiracy from Google there. I could have used Bing and the result would be the same.

And by the way, this also happens with paid traffic. The paid acquisition source will be attributed to all the following direct sessions for the same user. So this impacts all your marketing numbers as well.

Negligible or not, the issue is conceptual

Yes, I know… Some of you might say: “direct traffic is a really small % of our acquisition so it’s negligible”. First, where do you get that from? Because if it’s from Google Analytics, I’ve just shown you that it’s probably not a reliable information because “direct” traffic gets overwritten when possible.

Plus, GA incorrectly defines “direct” traffic by “missing referrer header”. Every day, more and more web browsers block the referrer header from their requests (mostly HTTPS traffic) or simply allow their users to opt-into the feature. GA would treat all that traffic as “direct” and would try to overwrite it. So yes, GA is always trying to infer more and more data, at a point where you can ask if all that inferred data even has a meaning… A user could come from another referral website without transmitting its referrer header (blocked by the website or by the client) and Google would attribute that visit to paid traffic because the same user recently clicked a paid advertisement.

I believe this exposes a very simple thing about Google Analytics. The fact that it would rather give you wrong data/insights/facts than say that it doesn’t know. At a time when it knows less and less as time passes, it invents more and more. Where’s the limit?

Is it OK to let Google Analytics take such decisions? For some, they might make sense, for others it’s inconceivable. It goes back to whether you’re a data person or not and your personal data philosophy.

Conclusion is …

I really have mixed feelings about GA. I honestly think that it’s a really great tool for people who need basic insights of what’s happening on their website but as soon as you want or need data accuracy, it’s over.

The issue is that it became an industry standard, with everyone using it and knowing how to use it without really understanding what it does behind the scenes. This is literally the definition of danger when it comes to data: believing without understanding.

--

--