Unlocking Insights: How to Decode Nested Google Analytics Data in BigQuery with SQL
In the realm of digital analytics, Google Analytics and Google BigQuery have emerged as premier tools, empowering brands to glean actionable insights from vast troves of user data. However, to truly tap into this goldmine, one must master the art of SQL, the language of databases. Let’s dive deep into a particularly potent SQL query tailored for Google Analytics datasets within BigQuery.
The Importance of Merging Google Analytics with BigQuery
Google Analytics has cemented its place as the go-to for understanding website user behavior. Meanwhile, Google BigQuery is revolutionizing how businesses handle massive datasets. When you pair Google Analytics’ user behavior tracking capabilities with BigQuery’s big data prowess, you’re setting yourself up for a rich, insightful analytics journey. Combine the two, and you get a powerhouse solution to store, query, and analyze expansive Google Analytics data. But the magic truly happens when you weave in SQL, as showcased by our spotlight query.
Understanding Nested Tables in Google Analytics Data
Google Analytics records a vast array of data points, often in the form of nested tables. This hierarchical structure allows for multi-level data storage, providing a wealth of details about user actions, device properties, geo-locations, and so on. Navigating these nested tables requires a specialized approach in SQL. Today, let’s break down a specialized SQL query designed for this purpose. But first, here is the Google Analytics 4 Database and Table structure. Let’s take a closer look into this -
Once you have finished integrating Google Analytics and Bigquery, you will come across a database with a name similar to this analytics_******
. Within the database, there will be two tables - events_****
is the main table, which stores all data, and events_intraday_*
is the other which is a temporary table for each day. Today, our focus will be on the main table. Within the table schema, you will notice nested columns such as event_params
separated by key and value. Additionally, there will be geo information and item information within nested columns. The details table structure is shown in the below images.
[Note: The events shown will not be the same in your case. It totally depends on the events set up in your Google Analytics Projects.]
Let's take a closer look at how the data is stored in the table. In the image below, there are keys such as ga_session_id
, page_location
, and page_title
, all associated with the event name add_to_cart
. And the values for these keys are stored as either string_value
or int_value
, depending on the data type. The same applies to user_properties
, as it is nested under various keys.
[Note: Each event has a unique key. Before extracting data, you must determine which event and key to track.]
However, the device and geo data are not nested under different keys but remain consistent across all events.
Unfurling the Layers of Our SQL Query
WITH analytics as
(
SELECT
PARSE_DATE('%Y%m%d' ,event_date) event_date,
event_name,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS page_title,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_path,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'percent_scrolled') AS percent_scrolled,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'action_source') AS action_source,
event_value_in_usd,
user_id,
user_pseudo_id,
(SELECT value.string_value FROM UNNEST(user_properties) WHERE key = 'name') AS name,
(SELECT value.string_value FROM UNNEST(user_properties) WHERE key = 'phone') AS phone,
device.category,
device.mobile_brand_name,
device.operating_system,
device.operating_system_version,
geo.country,
geo.city,
traffic_source.medium,
traffic_source.source,
platform,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'transaction_id') transaction_id,
(SELECT item_id FROM UNNEST(items) ) AS item_id,
(SELECT item_name FROM UNNEST(items) ) AS item_name,
(SELECT item_category FROM UNNEST(items) ) AS item_category,
(SELECT price FROM UNNEST(items) ) AS price,
(SELECT quantity FROM UNNEST(items) ) AS quantity,
(SELECT coupon FROM UNNEST(items) ) AS coupon,
(SELECT affiliation FROM UNNEST(items) ) AS affiliation,
(SELECT item_list_id FROM UNNEST(items) ) AS item_list_id,
(SELECT item_list_name FROM UNNEST(items) ) AS item_list_name
FROM `tenms-userdb.analytics_25********.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20230815' AND '20230820'
)
SELECT *
FROM analytics
1. The Power of Common Table Expressions (CTE):
With the WITH
keyword, we define a Common Table Expression named analytics
, acting as a temporary table. CTEs make complex queries more readable and maintainable.
2. Dealing with Nested Data Using UNNEST:
Here comes the key part of our query. Google Analytics data in BigQuery often uses a nested structure. Functions like UNNEST
are crucial for "flattening" these nested tables into something SQL can easily interpret.
For example, UNNEST(event_params)
converts the nested event_params
table into a series of rows that can be individually inspected.
(SELECT value.string_value FROM UNNEST(user_properties) WHERE key = 'name') AS name,
This line takes the nested user_properties
field, unnests it into separate rows, and filters out the row where the key is name
. This is how we extract specific information from the nested tables.
device.operating_system,
geo.country,
The device.operating_system
or geo.country
and similar columns are direct references to a nested field within the device
or geo
structure in the Google Analytics data in BigQuery. It doesn’t require any additional function like UNNEST
because it’s a singular attribute and not an array.
Another type of UNNEST
(SELECT item_id FROM UNNEST(items) ) AS item_id,
- Array vs. Nested Fields: The earlier example with
event_params
was dealing with an array of key-value pairs (nested fields). In that case, each element of the array had akey
and its associatedvalue
. So, to extract a specific piece of data, we had to filter on thekey
. - Array of Structures: In the case of
items
, it's an array of structures, but these structures aren't key-value pairs. Instead, each structure has multiple defined fields likeitem_id
,item_name
, and so on. There's no need to filter on akey
because the field names (likeitem_id
) are explicit and known.
By using UNNEST
on items
, we're breaking down the array into individual rows. After that, we can directly select a specific field (item_id
in this instance) from those rows.
In short, the UNNEST
function is used to handle the array nature of the items
field, while the absence of a key
filter is due to the direct and defined structure of each item within that array.
3. Accurate Date Parsing:
The function PARSE_DATE
converts date strings into date data types, facilitating easier date-based filtering and sorting.
PARSE_DATE('%Y%m%d' ,event_date) event_date,
4. Masterful Data Filtering:
We use WHERE _TABLE_SUFFIX BETWEEN '20230815' AND '20230820'
for filtering specific tables based on their suffixes. This is crucial for date-wise data partitioning in BigQuery.
5. Unleashing the Data:
The last part of the SQL query is responsible for actually pulling all the carefully curated data from our temporary analytics
table.
Turning SQL Queries into Real-World Insights:
The true power of decoding Google Analytics data in BigQuery using SQL extends far beyond just understanding the data’s structure. By adeptly leveraging this procedure, businesses can unearth a wealth of actionable insights tailored to their unique needs. For instance:
- E-commerce Tracking: Pinpoint users who added products to their cart but did not finalize the purchase. Understanding this helps to identify potential obstacles or pain points in the checkout process.
- Geo Analytics: By analyzing geo-locations, one can determine regions or countries generating the highest orders, paving the way for targeted marketing strategies or logistics optimizations.
- User Behavior Analysis: Delve into metrics like session time to unveil insights about user engagement. If a significant number of users have brief session times and don’t proceed past the add-to-cart stage, this might indicate potential website UI/UX issues or other deterrents.
- Customer Retention and Churn: Differentiate between returning customers and those who churn after a few interactions. Analyzing the behaviors of these two groups can foster strategies for customer retention and enhancing user experience.
- A/B Testing Analysis: With the capability to segment user behavior based on different versions of a webpage or feature, businesses can evaluate the effectiveness of A/B tests, driving optimization based on real user reactions.
- Feature Engagement Tracking: Go beyond generic user metrics and hone in on specific feature interactions. By doing so, one can determine which product features are resonating with users and which may require reevaluation or enhancement.
Final Thoughts:
Navigating Google Analytics data within BigQuery through SQL isn’t just a technical exercise — it’s a doorway to transformative business insights. As the digital landscape grows, mastering this trio can set you ahead in the data-driven decision-making game. Happy querying!
Go from SELECT * to interview-worthy project. Get our free 5-page guide.