Google Analytics 4 in Snowflake: SQL Cookbook

How to export GA4 from BigQuery into Snowflake, and translate the cookbook sample queries.

Felipe Hoffa
Nov 15 · 7 min read
Picture generated by AI (VQGAN+CLIP)

Google Analytics 4 exports event data from individual user level for free into BigQuery. You might then be wondering “How do I move this data into Snowflake, to get all the benefits of the Data Cloud?” and “Is there a quick way to translate the sample BigQuery queries into Snowflake SQL?”. That’s the goal of this post.

Export GA4 from BigQuery to Snowflake

Let’s start by exporting the GA4 sample e-commerce dataset.

Snowflake can read files directly from GCS — even if your Snowflake is running on AWS. To bring data from BigQuery to Snowflake you only need to ask BigQuery to export these tables into GCS:

Exporting GA4 sample data in the BigQuery web UI

To export manually, create a bucket in GCS to receive the exported tables. Exporting in Parquet format with snappy compression works well:

bq extract --destination_format=PARQUET --compression=SNAPPY bigquery-public-data:ga4_obfuscated_sample_ecommerce.events_20201202  gs://your-bucket/yourprefix/ga4sample-20201202-*

Setting up Snowflake to read securely from Google Cloud Storage is straightforward, just follow these steps:

list @fh_gcp_stage; -- check files existcreate or replace table ga4_variant(v variant);copy into ga4_variant
from @fh_gcp_stage/yourprefix/
pattern='yourprefix/ga4sample-.*'
file_format = (type='PARQUET');

Note that the script above will load all files in that specific folder, while skipping the ones it has already loaded into that specific table. See the LOAD_UNCERTAIN_FILES option for more details.

You can repeat this process for each new day with new data, but you could also set up Snowflake to automatically load each new file that shows up in GCS — check the docs for Snowpipe.

Getting ready to query

Data is ready to query now — we created a table with a single column of the VARIANT type. This type in Snowflake natively understands and optimizes semi-structured data, without the need to define a schema:

select distinct v:device.category
, v:traffic_source.medium
, v:user_ltv.currency
, v:geo.country
from ga4_variant

Not only it was easy to write the query, it also ran in only 147ms. Even better, the new Snowflake web UI automatically produces a summary of the results, as seen on the gray column to the right.

Even though VARIANTs in Snowflake are powerful, we might want to create a view over them to make further querying easier.

This view gives aliases to the VARIANT columns, defines types, and parses timestamps and dates:

create or replace view ga4 as
select to_timestamp(v:event_timestamp::int, 6) event_timestamp
, v:event_dimensions event_dimensions
, to_date(v:event_date::string, 'yyyymmdd') event_date
, v:event_name::string event_name
, v:user_id::string user_id
, v:privacy_info privacy_info
, v:stream_id::int stream_id
, v:event_server_timestamp_offset::int event_server_timestamp_offset
, parse_ga4_objarray(v:user_properties) user_properties
, v:device device
, v:platform::string platform
, parse_ga4_objarray(v:event_params) event_params
, v:event_previous_timestamp::int event_previous_timestamp
, v:geo geo
, v:traffic_source traffic_source
, v:ecommerce ecommerce
, v:items items
, v:event_bundle_sequence_id::int event_bundle_sequence_id
, v:user_ltv user_ltv
, v:event_value_in_usd::float event_value_in_usd
, v:user_pseudo_id::string user_pseudo_id
, to_timestamp(v:user_first_touch_timestamp::int, 6) user_first_touch_timestamp
, v:app_info app_info
from ga4_variant

Using this view our previous query changes to:

select distinct device:category
, traffic_source:medium
, user_ltv:currency
, geo:country
from ga4;

You might notice that each of these specific columns still has VARIANTs inside, which you navigate in the same way as before.

But there’s another element of this view that we need to get deeper into, the parse_ga4_objarray() UDF.

GA4 has columns with key-value mappings. These values were transformed into a complicated schema in BigQuery, making these queries too complicated.

For example, this is a query that unnests the key-values in BigQuery:

SELECT event_timestamp,
(
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'value'
) AS event_value
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE event_name = 'purchase'
AND _TABLE_SUFFIX BETWEEN '20201201' AND '20201202';

This is how the same query looks in Snowflake, after we setup the view with the UDF:

SELECT event_timestamp, event_params:value event_value
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02';

To make this happen our UDF takes the key-values in the BigQuery schema, and brings them back to a simple object schema:

create or replace function parse_ga4_objarray(V variant)
returns variant
language javascript
AS
$$
var result = {};
for (const x of V) {
result[x.key] = x.value ? Object.values(x.value)[0] : null;
}
return result
$$;

You can continue to use this view seamlessly, or you can materialize its results if you have huge datasets that you want to optimize for faster queries.

Translating the sample queries: Basic

With that said, now let’s translate the sample queries from BigQuery into Snowflake. Note that we won’t use a different table for each day, but instead we will have a single table with multiple dates.

Let’s start with the sample basic queries:

WITH
UserInfo AS (
SELECT
user_pseudo_id,
MAX(IFF(event_name IN ('first_visit', 'first_open'), 1, 0)) AS is_new_user
-- IFF instead of IF
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
GROUP BY 1
)
SELECT
COUNT(*) AS user_count,
SUM(is_new_user) AS new_user_count
FROM UserInfo;
SELECT COUNT(*) / COUNT(DISTINCT user_pseudo_id) AS avg_transaction_per_purchaser
FROM ga4
WHERE event_name IN ('in_app_purchase', 'purchase')
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
SELECT event_timestamp, event_params:value::float event_value
-- much easier with the object parser in the view
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02';
SELECT SUM(event_params:value) event_value
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
;
SELECT x.value:item_id::string item_id
, x.value:item_name::string item_name
, COUNT(DISTINCT user_pseudo_id) AS user_count
FROM ga4, table(flatten(items)) x
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
AND event_name IN ('add_to_cart')
GROUP BY 1, 2
ORDER BY user_count DESC
LIMIT 10;
WITH
UserInfo AS (
SELECT
user_pseudo_id,
COUNT_IF(event_name = 'page_view') AS page_view_count,
-- COUNT_IF instead of COUNTIF
COUNT_IF(event_name IN ('in_app_purchase', 'purchase')) AS purchase_event_count
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
GROUP BY 1
)
SELECT
(purchase_event_count > 0) AS purchaser,
COUNT(*) AS user_count,
SUM(page_view_count) AS total_page_views,
SUM(page_view_count) / COUNT(*) AS avg_page_views
FROM UserInfo
GROUP BY 1;
SELECT
user_pseudo_id,
event_timestamp,
event_params:ga_session_id AS ga_session_id,
event_params:page_location::string AS page_location,
event_params:page_title::string AS page_title
FROM ga4
WHERE event_name = 'page_view'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
ORDER BY user_pseudo_id, ga_session_id, event_timestamp ASC;

Translating the sample queries: Advanced

Let’s continue with the advanced sample queries:

WITH
Params AS (
-- Replace with selected item_name or item_id.
SELECT 'Google Navy Speckled Tee' AS selected_product
),
PurchaseEvents AS (
SELECT user_pseudo_id, items
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
AND event_name = 'purchase'
),
ProductABuyers AS (
SELECT DISTINCT user_pseudo_id
FROM Params, PurchaseEvents
, table(flatten(items)) items
-- flatten instead of unnest
WHERE items.value:item_name = selected_product
-- item.item_id can be used instead of items.item_name.
)
SELECT items.value:item_name::string AS item_name
, SUM(items.value:quantity) AS item_quantity
FROM
Params
, PurchaseEvents
, table(flatten(items)) AS items
WHERE
user_pseudo_id IN (SELECT user_pseudo_id FROM ProductABuyers)
-- item.item_id can be used instead of items.item_name
AND items.value:item_name != selected_product
GROUP BY 1
ORDER BY item_quantity DESC;
SELECT user_pseudo_id
, COUNT(DISTINCT(event_params:ga_session_id)) AS session_count
, AVG((event_params:value)) AS avg_spend_per_session_by_user
-- much simpler with our simplified schema
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
GROUP BY 1;

The results of this query will depend on how you want to manage timezones:

-- Replace timezone. List at https://en.wikipedia.org/wiki/List_of_tz_database_time_zones.
set REPORTING_TIMEZONE = 'America/Los_Angeles';
-- Replace list of user_pseudo_id's with ones you want to query.
set USER_PSEUDO_ID_LIST = '1005326.4012506369,1009730.0442567259,1029388.2450501039';
// TODO: Incorporate timezone adjustments to BQ sample
CREATE FUNCTION GetDateSuffix(date_shift int, timezone STRING)
returns string
as
$$
(SELECT '') --FORMAT_DATE('%Y%m%d', DATE_ADD(CURRENT_DATE(timezone), INTERVAL date_shift DAY)))
$$;
SELECT DISTINCT
user_pseudo_id,
FIRST_VALUE(event_params:ga_session_id) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS ga_session_id,
FIRST_VALUE(event_params:ga_session_number) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS ga_session_number
FROM ga4
WHERE array_contains(user_pseudo_id::variant, split($USER_PSEUDO_ID_LIST, ','))
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'

Next steps

Thanks to

I’ve been working with Minhaz Kazi — Developer Advocate at Google for Google Analytics to write this post. We’d love to keep working together to help you — let us know if you have any further ideas and/or needs.

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check reddit.com/r/snowflake for the most interesting Snowflake news.

Snowflake

Articles for engineers, by engineers.

Snowflake

Snowflake articles from engineers using Snowflake to power their data.

Felipe Hoffa

Written by

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

Snowflake

Snowflake articles from engineers using Snowflake to power their data.