How to work with Google Analytics data in BigQuery

Jimmy Pang
The Startup
Published in
9 min readJun 29, 2020
Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Preface

Google Analytics is widely used in various industries for tracking user activities in Web and mobile applications. For most of the users, the default reporting provided by GA is usually sufficient as long as the tracking of the application is in place (If not, go for the custom reporting! :p ).

Yet, once the business grows into a certain scale (20–50 employees or even bigger), there comes to a need to have the GA data available in a Data Warehouse for further in-depth analysis and even Machine Learning purposes.

Setup

The GA public data from BigQuery is used in the illustration:

SELECT *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`

And the preview of the data would be looking like this:

The blanks come from the ARRAY, which we will discuss later

And for the ease of demonstration, we would also assume that the Data Warehouse is on BigQuery instead of the other tech (e.g. Redshift or Snowflake).

It would be very nice if you have a GCP (Google Cloud Platform) project up & running already, so you could follow this guideline to gain a better understanding.

Getting Started — Build a View in the Data Layer

The official documentation from Google can be found here: GA BigQuery Export schema.

The highlights are the visitStartTime column and the name of the table bigquery-public-data.google_analytics_sample.ga_sessions_20170801. In short, the naming pattern of the BigQuery tables is {GCP project name}.{Dataset name}.{Table name}.

As for BigQuery tables from the native export from GA, they are actually Shaded Tables. Shaded Tables basically mean that they are all different table objects, but following the exact same naming convention.

In this context, it would be the a series of tables like this:

  • bigquery-public-data.google_analytics_sample.ga_sessions_20170801
  • bigquery-public-data.google_analytics_sample.ga_sessions_20170731
  • bigquery-public-data.google_analytics_sample.ga_sessions_20170730

etc.

How does it look like for Shaded Tables in BigQuery Web Console

To query one of the tables, the code is pretty simple:

SELECT *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`

Interesting question: What would you do if you wanna query more than 1 table with the exact same schema?

Answer: UNION ALL

SELECT *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
UNION ALLSELECT *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170731`

The query above allows us to query data from 31 July to 1 August 2017. But obviously it doesn’t help a lot with actual business use cases. We would need MUCH more data than 2 days.

To tackle this, BigQuery provides the solution of Wildcard Tables. It means we could do something like this:

SELECT *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

Yup, the magic is the * after the ga_sessions_.

A general business user would think this is good enough already. But for a data professional (especially a data engineer), it isn’t.

To make it more maintainable and more readable by humans, we need 2 more columns here: created_at & created_date.

Why? Because the schema of the GA tables doesn’t come with the date with the DATE format. The date column in the tables is in STRING format. To align with the partitioning of the other tables in the Data Warehouse, the columns created_date and created_at are necessary.

Here is a sample query of running against the GA tables:

WITH ga_data AS (
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS created_date
, TIMESTAMP_SECONDS(visitStartTime) AS created_at
, *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
)
SELECT *
FROM ga_data
WHERE 1=1
AND created_date = '2017-08-01'

The basic idea here is to leverage the non-materialistic nature of CTEs in BigQuery, and make it easier to build queries on top of it.

PARSE_DATE (Reference here) is a function taking a STRING value and return a DATE value. In this case, we are using the suffix of the Shaded Tables and cast it to DATE. By doing so, we could easily control which date's data to fetch and to play with.

Assuming we have a dataset in the GCP project named dl as Data Layer, we could create the View with DDL for the GA data here:

CREATE OR REPLACE VIEW `dl.ga_sessions`
AS
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS created_date
, TIMESTAMP_SECONDS(visitStartTime) AS created_at
, *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

To consume data from the view, just simply run this query:

SELECT *
FROM `dl.ga_sessions`
WHERE 1=1
AND created_date = '2017-08-01'

Note that the project name is not a must in the query. If it is omitted, BigQuery would fallback to the GCP project that you are running the query in and look for the View there.

Bonus — a little trick of SQL coding

If you are wondering what does 1=1 do — It is just a personal signature of coding, like Leonardo da Vinci ‘s signature on his arts ;)

Jokes aside, some people would do TRUE instead:

WITH ga_data AS (
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS created_date
, TIMESTAMP_SECONDS(visitStartTime) AS created_at
, *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
)
SELECT *
FROM ga_data
WHERE TRUE
AND created_date = '2017-08-01'

The idea is the same — to push the line of created_date = '2017-08-01'to a new line, so it would be much easy to comment that line out for playing around with the data:

WITH ga_data AS (
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS created_date
, TIMESTAMP_SECONDS(visitStartTime) AS created_at
, *
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
)
SELECT *
FROM ga_data
WHERE 1=1
-- AND created_date = '2017-08-01'
AND created_date = '2017-07-31'

This technique is pretty handy in the EDA phase, especially when you do not know what to expect in the data. Personally I prefer 1=1 over TRUE, as it is shorter and there are 2 1 — which makes it even faster to write.

Working with Custom Dimensions

If the company is getting sophisticated with the GA tracking, then the default columns from BigQuery will not be sufficient — we would need more than that. And this is where the Custom Dimensions & Metrics come in.

A typical case would be userId. GA does provide an identifier to “identify an user” by default, which is clientId. Yet, clientId is pretty unstable by nature and has its limitation. For example, the same person using Chrome and Firefox will have 2 different clientId. In this case, it would make more sense to assign that person an identifier instead. Ideally, it would be the hashed login credential for the compliance of GDPR.

Another typical use case would be geo-location. The geo tracking of GA is under a lot of factors, such as VPN. For example, if a user from Germany is using VPN routing it to Sweden, the tracking of GA would record the session as Sweden. In this case, it would make more sense to track the other information that would tell where is the user from, e.g. the URL of the site, the domain name of the user email (if possible) etc.

Assuming we have a Custom Dimensions tracking in place already, and its index is 1, then we could fetch it with scalar query.

SELECT *
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 1) AS user_id
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

This scalar query would work since the UNNEST clause would flatten the ARRAY of customDimensions and returns a table, which makes the querying possible. In laymen language: the query is used as a column and could return 1 value ONLY. It will break if it returns more than that.

And, if the Custom Dimensions grow and you would need to have a lot of them covered in the same query, the UDF solution is suggested here.

In code, it would be something like this:

CREATE TEMP FUNCTION extract_custom_dimensions(_index INT64, _properties ANY TYPE) AS (
(SELECT value FROM UNNEST(_properties) WHERE index = _index)
);
SELECT *
, extract_custom_dimensions(1, customDimensions) AS user_id
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

Assume we have a new Custom Dimension here named country_code in Session scope with the index 2, the code would be like:

CREATE TEMP FUNCTION extract_custom_dimensions(_index INT64, _properties ANY TYPE) AS (
(SELECT value FROM UNNEST(_properties) WHERE index = _index)
);
SELECT *
, extract_custom_dimensions(1, customDimensions) AS user_id
, extract_custom_dimensions(2, customDimensions) AS country_code
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

It may look redundant at the moment, but once the Custom Dimensions grow into 30+, the UDF here makes a lot of sense. Or else, you are gonna face queries like this:

SELECT *
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 1) AS user_id
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 2) AS cd_2
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 3) AS cd_3
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 4) AS cd_4
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 5) AS cd_5
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 6) AS cd_6
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 7) AS cd_7
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 8) AS cd_8
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 9) AS cd_9
, (SELECT value FROM UNNEST(customDimensions) WHERE index = 10) AS cd_10
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

This is a lot of duplicated code and the business logic is not very clear for laymen. It can be improved with UDF:

CREATE TEMP FUNCTION extract_custom_dimensions(_index INT64, _properties ANY TYPE) AS (
(SELECT value FROM UNNEST(_properties) WHERE index = _index)
);
SELECT *
, extract_custom_dimensions(1, customDimensions) AS user_id
, extract_custom_dimensions(2, customDimensions) AS cd_2
, extract_custom_dimensions(3, customDimensions) AS cd_3
, extract_custom_dimensions(4, customDimensions) AS cd_4
, extract_custom_dimensions(5, customDimensions) AS cd_5
, extract_custom_dimensions(6, customDimensions) AS cd_6
, extract_custom_dimensions(7, customDimensions) AS cd_7
, extract_custom_dimensions(8, customDimensions) AS cd_8
, extract_custom_dimensions(9, customDimensions) AS cd_9
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`

The query can be further shorten with a shorter UDF name here, but I would keep it this way for better readability.

Hits in BigQuery

According to the official doc from Google, a Hit is:

An interaction that results in data being sent to Analytics. Common hit types include page tracking hits, event tracking hits, and ecommerce hits.

In business sense, there will be many Hits in a Session. To reflect this business logic, Hits are recorded as ARRAY in BigQuery tables.

Depends on the use cases, there would be multiple ways to work with the arrays:

  • Flatten the table from Sessions level to Hits level
  • Fetch data from only the first hit
  • Get the numbers of Hits that fit certain condition(s)

To flatten the table, it is actually pretty straight forward:

SELECT s.*
, h.* EXCEPT(customDimensions)
, h.customDimensions AS hits_customDimensions
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` s
CROSS JOIN UNNEST(hits) h

UNNEST would return a table, and it is assigned with an alias h. The EXCEPT part is to avoid the duplicated column names since there are also customDimensions from the Session level. Hence, we would need to rename the column to keep both of them.

To fetch data from the first Hit, it is also simple:

SELECT *
, hits[OFFSET(0)] AS first_hit
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`

OFFSET is a function for ARRAY— Basically it could return the desired record with the index input here.

And, we could also run aggregation in the ARRAY with scalar queries. Here is an example getting the number of Hits that is an interaction per session:

SELECT *
, (SELECT COUNT(*) FROM UNNEST(hits) WHERE isInteraction) AS hits_is_interaction
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`

Conclusion

GA data is a lot of fun to work with (IF you ignore the common tracking issues caused by connectivity and VPN etc.), and the non-relational nature of BigQuery allows the GA data to be stored in ARRAY to avoid excessive duplication. All in all, there are a lot of neat tricks can be done with ARRAY, such as CROSS JOIN + UNNEST and scalar queries. There are a lot of functions in BigQuery like ARRAY_AGGand SPLIT bringing more flexibility and possibility for different kinds of data tasks. So it would be pretty handy to get familiar with the ways to work with ARRAY in BigQuery, especially in the context of working with GA data.

About the Author

Jimmy Pang is a current Business Intelligence Analyst. Specialized in data visualization. Currently pursuing the journey of Data Science.

--

--

Jimmy Pang
The Startup

Data Leader, Evangelist and Educator, dedicated to the data journey. Interested in tech and classy stuffs: art, whiskey, coffee, tea, spirituality, history etc.