Published in


Google Analytics 4 in Snowflake: SQL Cookbook

How to export GA4 from BigQuery into Snowflake, and translate the cookbook sample queries.

Picture generated by AI (VQGAN+CLIP)

Google Analytics 4 exports event data from individual user level for free into BigQuery. You might then be wondering “How do I move this data into Snowflake, to get all the benefits of the Data Cloud?” and “Is there a quick way to translate the sample BigQuery queries into Snowflake SQL?”. That’s the goal of this post.

Watch on Youtube

Export GA4 from BigQuery to Snowflake

Export from BigQuery to GCS

Let’s start by exporting the GA4 sample e-commerce dataset.

Snowflake can read files directly from GCS — even if your Snowflake is running on AWS. To bring data from BigQuery to Snowflake you only need to ask BigQuery to export these tables into GCS:

Exporting GA4 sample data in the BigQuery web UI

To export manually, create a bucket in GCS to receive the exported tables. Exporting in Parquet format with snappy compression works well:

bq extract --destination_format=PARQUET --compression=SNAPPY bigquery-public-data:ga4_obfuscated_sample_ecommerce.events_20201202  gs://your-bucket/yourprefix/ga4sample-20201202-*

Prepare your Snowflake account to read from GCS

Setting up Snowflake to read securely from Google Cloud Storage is straightforward, just follow these steps:

Create a table in Snowflake, read the exported Parquet files

list @fh_gcp_stage; -- check files existcreate or replace table ga4_variant(v variant);copy into ga4_variant
from @fh_gcp_stage/yourprefix/
file_format = (type='PARQUET');

Note that the script above will load all files in that specific folder, while skipping the ones it has already loaded into that specific table. See the LOAD_UNCERTAIN_FILES option for more details.

Automate with Snowpipe

You can repeat this process for each new day with new data, but you could also set up Snowflake to automatically load each new file that shows up in GCS — check the docs for Snowpipe.

Getting ready to query

Semi-structured data in Snowflake with VARIANT

Data is ready to query now — we created a table with a single column of the VARIANT type. This type in Snowflake natively understands and optimizes semi-structured data, without the need to define a schema:

select distinct v:device.category
, v:traffic_source.medium
, v:user_ltv.currency
from ga4_variant

Not only it was easy to write the query, it also ran in only 147ms. Even better, the new Snowflake web UI automatically produces a summary of the results, as seen on the gray column to the right.

Even though VARIANTs in Snowflake are powerful, we might want to create a view over them to make further querying easier.

Creating a view for the raw GA4 data

This view gives aliases to the VARIANT columns, defines types, and parses timestamps and dates:

create or replace view ga4 as
select to_timestamp(v:event_timestamp::int, 6) event_timestamp
, v:event_dimensions event_dimensions
, to_date(v:event_date::string, 'yyyymmdd') event_date
, v:event_name::string event_name
, v:user_id::string user_id
, v:privacy_info privacy_info
, v:stream_id::int stream_id
, v:event_server_timestamp_offset::int event_server_timestamp_offset
, parse_ga4_objarray(v:user_properties) user_properties
, v:device device
, v:platform::string platform
, parse_ga4_objarray(v:event_params) event_params
, v:event_previous_timestamp::int event_previous_timestamp
, v:geo geo
, v:traffic_source traffic_source
, v:ecommerce ecommerce
, v:items items
, v:event_bundle_sequence_id::int event_bundle_sequence_id
, v:user_ltv user_ltv
, v:event_value_in_usd::float event_value_in_usd
, v:user_pseudo_id::string user_pseudo_id
, to_timestamp(v:user_first_touch_timestamp::int, 6) user_first_touch_timestamp
, v:app_info app_info
from ga4_variant

Using this view our previous query changes to:

select distinct device:category
, traffic_source:medium
, user_ltv:currency
, geo:country
from ga4;

You might notice that each of these specific columns still has VARIANTs inside, which you navigate in the same way as before.

But there’s another element of this view that we need to get deeper into, the parse_ga4_objarray() UDF.

Saner key-value schemas with a UDF

GA4 has columns with key-value mappings. These values were transformed into a complicated schema in BigQuery, making these queries too complicated.

For example, this is a query that unnests the key-values in BigQuery:

SELECT event_timestamp,
SELECT COALESCE(value.int_value, value.float_value, value.double_value)
FROM UNNEST(event_params)
WHERE key = 'value'
) AS event_value
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE event_name = 'purchase'
AND _TABLE_SUFFIX BETWEEN '20201201' AND '20201202';

This is how the same query looks in Snowflake, after we setup the view with the UDF:

SELECT event_timestamp, event_params:value event_value
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02';

To make this happen our UDF takes the key-values in the BigQuery schema, and brings them back to a simple object schema:

create or replace function parse_ga4_objarray(V variant)
returns variant
language javascript
var result = {};
for (const x of V) {
result[x.key] = x.value ? Object.values(x.value)[0] : null;
return result

2022–03–15 — Updated UDF:

Thanks to Gak Ta for this improved UDF:

create or replace function parse_ga4_objarray_fixed(V variant)
returns variant
language javascript
var result = {};
for (const x of V) {
if (x.value){
for (const [key, value] of Object.entries(x.value)) {
if ( key ! = 'set_timestamp_micros') {
result[x.key] = value;
} else {
result[x.key] = null;
return result

You can continue to use this view seamlessly, or you can materialize its results if you have huge datasets that you want to optimize for faster queries.

Translating the sample queries: Basic

With that said, now let’s translate the sample queries from BigQuery into Snowflake. Note that we won’t use a different table for each day, but instead we will have a single table with multiple dates.

Let’s start with the sample basic queries:

Example: Query a specific date range for selected events

UserInfo AS (
MAX(IFF(event_name IN ('first_visit', 'first_open'), 1, 0)) AS is_new_user
-- IFF instead of IF
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
COUNT(*) AS user_count,
SUM(is_new_user) AS new_user_count
FROM UserInfo;

Example: Average number of transactions per purchaser

SELECT COUNT(*) / COUNT(DISTINCT user_pseudo_id) AS avg_transaction_per_purchaser
FROM ga4
WHERE event_name IN ('in_app_purchase', 'purchase')
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'

Example: Query values for a specific event name

SELECT event_timestamp, event_params:value::float event_value
-- much easier with the object parser in the view
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02';

Example: Query total value for a specific event name.

SELECT SUM(event_params:value) event_value
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'

Example: Top 10 items added to cart by most users

SELECT x.value:item_id::string item_id
, x.value:item_name::string item_name
, COUNT(DISTINCT user_pseudo_id) AS user_count
FROM ga4, table(flatten(items)) x
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
AND event_name IN ('add_to_cart')
ORDER BY user_count DESC

Example: Average number of pageviews by purchaser type

UserInfo AS (
COUNT_IF(event_name = 'page_view') AS page_view_count,
-- COUNT_IF instead of COUNTIF
COUNT_IF(event_name IN ('in_app_purchase', 'purchase')) AS purchase_event_count
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
(purchase_event_count > 0) AS purchaser,
COUNT(*) AS user_count,
SUM(page_view_count) AS total_page_views,
SUM(page_view_count) / COUNT(*) AS avg_page_views
FROM UserInfo

Example: Sequence of pageviews

event_params:ga_session_id AS ga_session_id,
event_params:page_location::string AS page_location,
event_params:page_title::string AS page_title
FROM ga4
WHERE event_name = 'page_view'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
ORDER BY user_pseudo_id, ga_session_id, event_timestamp ASC;

Translating the sample queries: Advanced

Let’s continue with the advanced sample queries:

Example: Products purchased by customers who purchased a specific product

Params AS (
-- Replace with selected item_name or item_id.
SELECT 'Google Navy Speckled Tee' AS selected_product
PurchaseEvents AS (
SELECT user_pseudo_id, items
FROM ga4
WHERE event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'
AND event_name = 'purchase'
ProductABuyers AS (
SELECT DISTINCT user_pseudo_id
FROM Params, PurchaseEvents
, table(flatten(items)) items
-- flatten instead of unnest
WHERE items.value:item_name = selected_product
-- item.item_id can be used instead of items.item_name.
SELECT items.value:item_name::string AS item_name
, SUM(items.value:quantity) AS item_quantity
, PurchaseEvents
, table(flatten(items)) AS items
user_pseudo_id IN (SELECT user_pseudo_id FROM ProductABuyers)
-- item.item_id can be used instead of items.item_name
AND items.value:item_name != selected_product
ORDER BY item_quantity DESC;

Example: Average amount of money spent per purchase session by user

SELECT user_pseudo_id
, COUNT(DISTINCT(event_params:ga_session_id)) AS session_count
, AVG((event_params:value)) AS avg_spend_per_session_by_user
-- much simpler with our simplified schema
FROM ga4
WHERE event_name = 'purchase'
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'

Get the latest ga_session_id and ga_session_number for specific users during last 4 days

The results of this query will depend on how you want to manage timezones:

-- Replace timezone. List at
set REPORTING_TIMEZONE = 'America/Los_Angeles';
-- Replace list of user_pseudo_id's with ones you want to query.
set USER_PSEUDO_ID_LIST = '1005326.4012506369,1009730.0442567259,1029388.2450501039';
// TODO: Incorporate timezone adjustments to BQ sample
CREATE FUNCTION GetDateSuffix(date_shift int, timezone STRING)
returns string
(SELECT '') --FORMAT_DATE('%Y%m%d', DATE_ADD(CURRENT_DATE(timezone), INTERVAL date_shift DAY)))
FIRST_VALUE(event_params:ga_session_id) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS ga_session_id,
FIRST_VALUE(event_params:ga_session_number) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS ga_session_number
FROM ga4
WHERE array_contains(user_pseudo_id::variant, split($USER_PSEUDO_ID_LIST, ','))
AND event_timestamp BETWEEN '2020-12-01' AND '2020-12-02'

Next steps

Thanks to

I’ve been working with Minhaz Kazi — Developer Advocate at Google for Google Analytics to write this post. We’d love to keep working together to help you — let us know if you have any further ideas and/or needs.

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check for the most interesting Snowflake news.




Snowflake articles from engineers using Snowflake to power their data.

Recommended from Medium

Building a Terraform Provider-Part-IV — Import and Build

Leetcode MySQL 627. Swap Salary

What the new Scrum Guide 2020 means for our daily job!

#[Clymene] v1.2.0 Release — Opentsdb Support

Automate your gadgets with Voice Assistant on Your Smart Phone (4 Simple Steps)

Decorator Design Pattern

How to make a mobile application?

Road to System Architect: Non-functional Requirements

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.

More from Medium

SQL scripting: Live in Snowflake

Passing Snowflake Hands On Essentials — Data Warehouse

How To Generate an Audit Table with Python and Matillion ETL for Snowflake

How to Generate ERDs from a Snowflake Model