How to transform and export GA4 event data from BigQuery

Epsilon India
Epsilon Engineering Blog
8 min readFeb 29, 2024

--

By Khushalie Nagpal and Prateek Shekhar

If you are setting up a big query backend to pull data for GA4 and don’t know where to start, you are at the right destination. In this article, we plan to cover everything you need to know to pull data from BigQuery — right from knowing how to setup big query for GA4 to raw event data transformation to final data validation before generating business insights using this data.

In summary, we plan to cover the following:

1. Some use cases explaining why you need big query for GA4 data pull?

2. How to setup big query backend for your data pull?

3. QC and validation of the data pulled via Big Query

Why we felt the need of BigQuery for GA4?

Ever since Google launched GA4 for website data pull, there have been discussions around why one should be using BigQuery, detailing the wide range of benefits of using BigQuery for GA4 over just the GA4 UI. In this article, we will begin by sharing a set of actual business use cases for some of our pharma clients, which led us to utilize BigQuery.

1.Unavailability of some KPIs in GA4 UI

Even though clients are transitioning from GAUA to GA4 (due to the upcoming sunset of GA UA), they are still used to and want to keep reporting the same (or similar) KPIs that existed while they were reporting in GA UA. Among those, one such KPIs is ‘unique events’, a metric that existed in GA UA but is not present in GA4. GA4 only has ‘event count’, which is closer to ‘total events’ in GA UA.

1.1 How are Unique Events calculated in BigQuery using raw data export from GA4?

In Google Analytics, a unique event is defined as:

“Unique events in Google Analytics are acts that are only counted once per session, no matter how many times they happen within a session.”

To calculate the unique event count, Analytics increments the unique events metric by 1 the first time during a session that it receives an event with a unique combination of category, action, and label and ignores subsequent events in the session that have the same combination of ‘category’, ‘action’, and ‘label’.

Following similar lines in GA4, unique events can be calculated by taking a distinct count of the concatenation of “user_pseudo_id” and “ga_session_id” against unique combinations of the event name and corresponding event parameters.

Shared below is a piece of code written to calculate unique events for an event called video (event_name =” video”) with associated event parameters as “video_title” and “video_percent”.

Here is the output we got when we ran the above query for one of our clients for the month of January 2024:

Below is a screenshot of the report for the same metric during the same period:

Remember, GA4 and GA UA are two different platforms with very different measurement models. So, a marginal difference is always a possibility.

If your business or client needs to report on ‘Unique Events’ (or any such KPIs that you need to report or were reporting in the past that don’t exist in GA4), then you know, BigQuery is your way, as BigQuery for GA4 provides raw data that you can manipulate, play around with, and create the logic to report the required metrics.

2. GA4 UI segment limitations

GA4 UI segments have some limitations, like their unavailability in standard reports (but only in exploration reports) and their inability to share segments under exploration reports. Some of these restrictions on GA4 segments make it cumbersome for segments to be used easily. If your business or client needs a report that every time requires a segment to be applied before analysis or reporting, it makes sense to use Big Query. For instance, one of our client website data sets is inclusive of bot traffic, so we put these bot conditions in big query backend data before data goes to the reporting table. This way, all our client reporting tables are excluding known bots.

2.1 How to remove bot traffic from your GA4 data within BigQuery?

Well, the principle here is to check if the bot conditions specific to your website are true against each session and marking out these sessions. Once marked, we can simply filter out these sessions from our final output. Let us see how.

3. Low occurrence KPIs

If your business or client has conversion events which are low in number (which could be a common use case as lot of businesses with large ticket size or value have low conversions), there is a possibility that these conversions are not reported. Some of our clients have these low occurrence events, so, we chose to use BigQuery (as BigQuery does not have any such limitation).

4. Detailed analysis requiring information at user and time stamp level.

At times, businesses require a detailed and in-depth analysis where we need to drill down a dimension further by multiple dimensions. BigQuery comes in handy and easy. For instance, user-level behavior by ‘campaign’, ‘landing page’ for each session with an ‘event timestamp’ to identify bot traffic. Such an analysis would not only be tedious but also not supported by the GA4 UI and would require raw data from BigQuery.

For instance, we carried out an exercise to identify bot traffic coming from Bing/CPC traffic for one of our clients. To do so, we analyzed certain fields, such as the event name along with its timestamp, source, medium, device, and operating system, at the session ID level.

Refer the Code below:

Upon analyzing the data, we discovered that a specific conversion event on the site was performed 90 times within a single session, which was a clear indicator of bot behavior that we were successfully able to identify for our client.

How to get started with pulling data from BigQuery?

Step #1 — List down KPIs for big query data pull

Once you know that your business or client needs BigQuery, the next step is to list down data analysis or reporting use cases that you would want to solve via BigQuery.

  1. Begin by reviewing existing client/business regular and ad hoc reports. Make a list of all dimensions and metrics reported.
  2. Once you have these KPIs, you need to identify which of them are available in the BigQuery backend versus what needs to be calculated and created.

e.g., ‘source’ and ‘medium’ are available dimensions, but ‘channel grouping’ must be created based on one’s business requirements.

Existing fields within GA4 BQ export to capture session acquisition fields such as session source and session medium are based on UTM values.

In GA4, session or medium can be different in the same session; therefore, we select the first source or medium for a session.

Step #2: Visualize the number of backend reporting tables that you would need for the BigQuery backend setup.

For instance, if your business regularly needs to have traffic and conversions by channel, campaign, device, or landing page, you may choose to create three separate tables:

  1. Traffic table — which has channel (source and medium), landing page, campaign, device for engagement metrics like session, engaged sessions, average engagement time on site, bounce rate, etc.
  2. Conversion table — which has the same dimensions as above (in the traffic table) but for conversions.
  3. Page table — putting ‘page’ dimension in traffic and conversion tables (we created above) — would increase the table size to quite an extent due to existing dimensions being drilled down for each (and the same page being repeated many times due to the addition of query string parameter). This could slightly slow down query processing. Also, as these fields are not quite frequently used in our reports, we kept them separate for any analysis needed.
Snippet showing three reporting tables to transform and store GA4 data across three major groups — Conversions, Pages, Traffic

Step #3 — Data pull

Once your big query data backend is ready, you can either connect the big query backend to Looker Studio or Google Sheets to get your data out. Both turn out to be good ways to get website performance data for our monthly reports. Depending on your comfort with the platform you may choose — Looker or Google Sheets.

We prefer Looker Studio as:

  1. For now, we have not come across any limitations with Looker (for now, we have pulled around 75K a month) with the number of rows pulled from BigQuery (but Google Sheets has a limit of 50K).
  2. Once you set a table or chart in Looker, you just need to refresh to see the most updated data. Google Sheets is similar, with just 1–2 more steps for data refresh.
  3. Setting up calculated metrics (like bounce rate, engagement rate, etc.) is a one-time effort, while with Google Sheets pivots, you must be careful to avoid any mistakes.
  4. If you download a Google Sheet report, the CSV date format gets messed up, but Excel works fine (though Excel has other issues with some workarounds).

Step #4 — Data validation and QC

Once you have the data available from the big query, one last step that you would need to perform is to ensure that the data is good to report (error-free and accurate). You can keep the following checklist to mark:

  1. Document and review all the logics that you applied in calculating and creating various BigQuery dimensions and metrics, e.g., channel grouping logic, landing page logic, session count, unique event, average engagement time on site metric calculation, etc. Review this with business SPOC to make sure that business understanding is aligned with the logic set.
  2. Verify all segment conditions are applied correctly.
  3. Have a reference point with which you will match your data, e.g., GA UA and GA4 UI — both can directionally help you identify (if not confirm) the data variance — dimensions with which you see a difference greater than 1–2% call for a logic review:

e.g., breakdown of traffic and unique events (conversions) by source, medium only, helped us identify quite a few backend logics that we could further improvise.

Once you have reviewed and validated your data pulled from BigQuery, you are good to report it and analyze it.

--

--

Epsilon India
Epsilon Engineering Blog

Leader in outcome-based marketing, with a rich, 50-year heritage in helping marketers anticipate, activate and prove measurable business outcomes.