Learning SQL

Focused on helping you learn SQL

Unlocking Insights: How to Decode Nested Google Analytics Data in BigQuery with SQL

--

In the realm of digital analytics, Google Analytics and Google BigQuery have emerged as premier tools, empowering brands to glean actionable insights from vast troves of user data. However, to truly tap into this goldmine, one must master the art of SQL, the language of databases. Let’s dive deep into a particularly potent SQL query tailored for Google Analytics datasets within BigQuery.

Image by Author

The Importance of Merging Google Analytics with BigQuery

Google Analytics has cemented its place as the go-to for understanding website user behavior. Meanwhile, Google BigQuery is revolutionizing how businesses handle massive datasets. When you pair Google Analytics’ user behavior tracking capabilities with BigQuery’s big data prowess, you’re setting yourself up for a rich, insightful analytics journey. Combine the two, and you get a powerhouse solution to store, query, and analyze expansive Google Analytics data. But the magic truly happens when you weave in SQL, as showcased by our spotlight query.

Understanding Nested Tables in Google Analytics Data

Google Analytics records a vast array of data points, often in the form of nested tables. This hierarchical structure allows for multi-level data storage, providing a wealth of details about user actions, device properties, geo-locations, and so on. Navigating these nested tables requires a specialized approach in SQL. Today, let’s break down a specialized SQL query designed for this purpose. But first, here is the Google Analytics 4 Database and Table structure. Let’s take a closer look into this -

Once you have finished integrating Google Analytics and Bigquery, you will come across a database with a name similar to this analytics_******. Within the database, there will be two tables - events_**** is the main table, which stores all data, and events_intraday_* is the other which is a temporary table for each day. Today, our focus will be on the main table. Within the table schema, you will notice nested columns such as event_params separated by key and value. Additionally, there will be geo information and item information within nested columns. The details table structure is shown in the below images.

[Note: The events shown will not be the same in your case. It totally depends on the events set up in your Google Analytics Projects.]

Google Analytics 4 BigQuery Database structure.
Google Analytics 4 BigQuery Database
Nested Schema of Google Analytics Table

Let's take a closer look at how the data is stored in the table. In the image below, there are keys such as ga_session_id, page_location, and page_title, all associated with the event name add_to_cart. And the values for these keys are stored as either string_value or int_value, depending on the data type. The same applies to user_properties, as it is nested under various keys.

[Note: Each event has a unique key. Before extracting data, you must determine which event and key to track.]

The nested data structure of Google Analytics Table

However, the device and geo data are not nested under different keys but remain consistent across all events.

The nested data structure of Google Analytics Table -2

Unfurling the Layers of Our SQL Query

WITH analytics as 
(
SELECT
PARSE_DATE('%Y%m%d' ,event_date) event_date,
event_name,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS page_title,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_path,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'percent_scrolled') AS percent_scrolled,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'action_source') AS action_source,
event_value_in_usd,
user_id,
user_pseudo_id,

(SELECT value.string_value FROM UNNEST(user_properties) WHERE key = 'name') AS name,
(SELECT value.string_value FROM UNNEST(user_properties) WHERE key = 'phone') AS phone,
device.category,
device.mobile_brand_name,
device.operating_system,
device.operating_system_version,
geo.country,
geo.city,

traffic_source.medium,
traffic_source.source,
platform,

(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'transaction_id') transaction_id,

(SELECT item_id FROM UNNEST(items) ) AS item_id,
(SELECT item_name FROM UNNEST(items) ) AS item_name,
(SELECT item_category FROM UNNEST(items) ) AS item_category,
(SELECT price FROM UNNEST(items) ) AS price,
(SELECT quantity FROM UNNEST(items) ) AS quantity,
(SELECT coupon FROM UNNEST(items) ) AS coupon,
(SELECT affiliation FROM UNNEST(items) ) AS affiliation,
(SELECT item_list_id FROM UNNEST(items) ) AS item_list_id,
(SELECT item_list_name FROM UNNEST(items) ) AS item_list_name

FROM `tenms-userdb.analytics_25********.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20230815' AND '20230820'
)

SELECT *
FROM analytics

1. The Power of Common Table Expressions (CTE):

With the WITH keyword, we define a Common Table Expression named analytics, acting as a temporary table. CTEs make complex queries more readable and maintainable.

2. Dealing with Nested Data Using UNNEST:

Here comes the key part of our query. Google Analytics data in BigQuery often uses a nested structure. Functions like UNNEST are crucial for "flattening" these nested tables into something SQL can easily interpret.

For example, UNNEST(event_params) converts the nested event_params table into a series of rows that can be individually inspected.

(SELECT value.string_value FROM UNNEST(user_properties) WHERE key = 'name') AS name,

This line takes the nested user_properties field, unnests it into separate rows, and filters out the row where the key is name. This is how we extract specific information from the nested tables.

device.operating_system,
geo.country,

The device.operating_system or geo.country and similar columns are direct references to a nested field within the device or geostructure in the Google Analytics data in BigQuery. It doesn’t require any additional function like UNNEST because it’s a singular attribute and not an array.

Another type of UNNEST

(SELECT item_id FROM UNNEST(items) ) AS item_id,
  • Array vs. Nested Fields: The earlier example with event_params was dealing with an array of key-value pairs (nested fields). In that case, each element of the array had a key and its associated value. So, to extract a specific piece of data, we had to filter on the key.
  • Array of Structures: In the case of items, it's an array of structures, but these structures aren't key-value pairs. Instead, each structure has multiple defined fields like item_id, item_name, and so on. There's no need to filter on a key because the field names (like item_id) are explicit and known.

By using UNNEST on items, we're breaking down the array into individual rows. After that, we can directly select a specific field (item_id in this instance) from those rows.

In short, the UNNEST function is used to handle the array nature of the items field, while the absence of a key filter is due to the direct and defined structure of each item within that array.

3. Accurate Date Parsing:

The function PARSE_DATE converts date strings into date data types, facilitating easier date-based filtering and sorting.

PARSE_DATE('%Y%m%d' ,event_date) event_date,

4. Masterful Data Filtering:

We use WHERE _TABLE_SUFFIX BETWEEN '20230815' AND '20230820' for filtering specific tables based on their suffixes. This is crucial for date-wise data partitioning in BigQuery.

5. Unleashing the Data:

The last part of the SQL query is responsible for actually pulling all the carefully curated data from our temporary analytics table.

Turning SQL Queries into Real-World Insights:

The true power of decoding Google Analytics data in BigQuery using SQL extends far beyond just understanding the data’s structure. By adeptly leveraging this procedure, businesses can unearth a wealth of actionable insights tailored to their unique needs. For instance:

  • E-commerce Tracking: Pinpoint users who added products to their cart but did not finalize the purchase. Understanding this helps to identify potential obstacles or pain points in the checkout process.
  • Geo Analytics: By analyzing geo-locations, one can determine regions or countries generating the highest orders, paving the way for targeted marketing strategies or logistics optimizations.
  • User Behavior Analysis: Delve into metrics like session time to unveil insights about user engagement. If a significant number of users have brief session times and don’t proceed past the add-to-cart stage, this might indicate potential website UI/UX issues or other deterrents.
  • Customer Retention and Churn: Differentiate between returning customers and those who churn after a few interactions. Analyzing the behaviors of these two groups can foster strategies for customer retention and enhancing user experience.
  • A/B Testing Analysis: With the capability to segment user behavior based on different versions of a webpage or feature, businesses can evaluate the effectiveness of A/B tests, driving optimization based on real user reactions.
  • Feature Engagement Tracking: Go beyond generic user metrics and hone in on specific feature interactions. By doing so, one can determine which product features are resonating with users and which may require reevaluation or enhancement.

Final Thoughts:

Navigating Google Analytics data within BigQuery through SQL isn’t just a technical exercise — it’s a doorway to transformative business insights. As the digital landscape grows, mastering this trio can set you ahead in the data-driven decision-making game. Happy querying!

[Feel free to contact me through my Website or LinkedIn, or leave a message below for any questions, critiques, or compliments.] 😄

Go from SELECT * to interview-worthy project. Get our free 5-page guide.

--

--

Niloy Biswas
Niloy Biswas

Written by Niloy Biswas

A Data Science Enthusiast. Working as a Data Analyst at 10 Minute School.

Responses (1)