Understand the GA4 BigQuery raw data with consent mode activated

Guillaume Wagner
6 min readMay 5, 2024

The impact of the GA4 consent mode on BigQuery is sometimes difficult to assess, because the data collected through “cookieless pings” is by design very complex to stitch or to analyze.

Should you partially or totally ignore it ? Should you embrace it ? Or should you just go on a rampage and destroy everything ? I will help you navigate through this and will propose potential solutions to prevent your brain from melting.

Just breath, it’s gonna be alright

What’s all the fuss about?

When you operate in a country that has privacy regulations, you are legally bound to request user consent before collecting their data. GA4 is strongly impacted by this loss of data. That’s why they introduced “cookieless pings”. T̶h̶e̶ c̶o̶n̶c̶e̶p̶t̶ i̶s̶ s̶i̶m̶p̶l̶e̶:̶ p̶e̶o̶p̶l̶e̶ d̶o̶ n̶o̶t̶ g̶i̶v̶e̶ t̶h̶e̶i̶r̶ c̶o̶n̶s̶e̶n̶t̶ b̶u̶t̶ y̶o̶u̶’r̶e̶ c̶o̶l̶l̶e̶c̶t̶i̶n̶g̶ t̶h̶e̶i̶r̶ d̶a̶t̶a̶ a̶n̶y̶w̶a̶y̶. When people do not give their consent, you can collect anonymous events that help GA4 modelize the reality of your website activity. Let’s review some of the impact it can have on your data in BigQuery:

No User Pseudo ID

▶ Those events will have no “user_pseudo_id”, which is what you’ll usually use to count users based on their cookie. If you have a user_id, you should probably put rules in place not to collect it, but I’ll let your legal team decide that.

No Session ID

▶ Those events will have no ga_session_id parameters included. Nice, huh ?

Unreliable session_start and first_visit

▶Each page reload incurs a new first_visit and a session_start, which means that you can’t count new users or sessions with those events.

The Single Page App case

… Except in the case of Single Page Apps, where technically, the page context is not reloaded when changing page. However, when people just refresh their page manually, or if you have a “new tab” internal link it will. TLDR: Still unreliable, “same same but different”.

What happens when you compare visit metrics ?

You can see below a simple query in the 2 systems for the same period for a property that does not collect any user_id. GA4 is infering sessions, users and new users and is obviously not using the raw data only.

Which gives:

For metrics that are calculated in GA4 based on events (new_users and sessions), GA4 lands in the middle. It knows the count of events is not real anymore. For metrics based on distinct user keys in the database, it inflates them as it know we have missing data

What happens when you compare specific events ?

Miracle, it all matches perfectly. You will probably start by thinking it’s an excellent news, well, it is, however, you need to be careful. Because if you calculate for exemple pageviews per sessions, that your pageview count is perfect, but that your sessions count is highly impacted by Consent mode, this means that your pageviews / sessions count is screwed. Same for conversions 🥵

One could argue that analytics is not about absolute numbers, it’s about trends, true, yes. Perfect. However:

  • If you’re reading this article, you are probably not the data end user, and they might not take numbers with the same caution
  • If you have one region with a consent management platform (Quebec? California ? Europe ? and some regions with no CMP, then you could be confronted to biased comparisons between regions very quickly.

What are your solutions ?

Cautious use of event VS visit metrics in BQ
The first golden rule is probably to stop thinking in terms of users and sessions in BigQuery. Well, if you do, you need to be careful and make sure to only look at the trends and not at absolute numbers. However, even trends should be taken with caution because a change in your Cookie Consent rate could impact the trend. For example, if you have a banner and switch for a popup, that would very probaby increase your consent rate, hence showing more visitors.

Filter out non-consenting people out of your reports
The second possibility is to only analyze people that consent to your event collection. That’s the easiest solution to implement (hi, “where user_pseudo_id is not null”, bye). However, please understand that consent rate is not statistically equal in your audience. Studies show that consent is higher for people who already know your brand/website, giving you a new bias to consider. Also, if the end users are concerned about “getting the same numbers as the GA4 interface” (which they should not, but believe me it’s not that easy to understand for everyone), this technique will not work alone.

Use the API for some metrics
Keep using the raw data for user flow analysis or most ecommerce reports, and use the API to extract reports like traffic acquisition or other reports that could be biased in your context. This usually gives a lot of confidence to business users since numbers are almost (see below) matching the interface. Some tools like Airbyte or Meltano do it very well with a very low effort but it adds another pipeline to maintain… Who likes new pipelines to maintain ?

Also, the API (and the interface) has one drawback you’ll be very happy I told you before you lose your mind: The sum of values of all lines in a report is always higher than the total the report shows. Yes, this is awful.

A difference of 5%. This depends highly on the number of dimensions and their cardinality.

Contact me if you want to know more. I’ll probably prepare an article about this eventually.

Reproduce GA4 modeling in BigQuery
The most exciting yet complicated of your solutions. Why wouldn’t you be able to do what they do, if you have the data too, that should not be complicated, right ?

Well I don’t know, I’ve never tried yet, but I fear it as much as I want it. I already know the story: You get a first interesting result in a reasonable time, you’re excited and confident, you feel you’re close to being able to use it, but the last 20% of the effort to get it production-ready is in fact 80% of the time of the project. Devil lies in the details.

Actual photo of your business users waiting for their model to get in production

Another point: we do not really know if Google only uses your own property data in the model. They have access to a majority of the planet’s websites data. Maybe they used it to train their model ? It’s a black box.

I’ve seen POCs on Linkedin posts lately of people that do basic statistic calculations to modelize those metrics. For example, you take the number of events for each session and user in consented data and you infer the metrics of non-consented data. To be statistically acceptable, we need to make sure that consented and non-consented visitors have the same behavior, which we already discussed: studies show they do not. So, what’s the value ? Hard to say ¯\_(ツ)_/¯

Conclusion

I understand this can be disappointing as there is no perfect way if tackling this, but this article gives you a good overview of my humble experience on the subject.

The solutions highlighted in this article should probably all be used together depending on the context and the maturity of your team. Once again, don’t overthink things, it’s just web analytics. You will be much more efficient if you accept the data as it is, share this knowledge to business users, and take biases it into account when producing analyses.

--

--

Guillaume Wagner

Data Collection and Engineering at Adviso, Montreal. Certified GCP Professional Data Engineer. All opinions are my own.