How to use SELECT FROM UNNEST to analyze multiple parameters in BigQuery for Analytics
(Man, I need shorter titles for my blog posts)
Hey, there BigQuery-for-Google-Analytics-for-Firebase developers!
In our previous blog post, I showed you how to use the UNNEST
function in BigQuery to analyze event parameters in your Google Analytics for Firebase data.
You need to use this function because normally, event parameters are stored as a repeated record, which you can think of as something like an array. By UNNEST
ing them, you’re able to “break apart” this array, place each individual event parameter into a new row, and then duplicate the original row for each of these individual parameters. Much like in this animation:
Now this is really useful when you want to analyze one event parameter in your analytics data. But what happens if you want to analyze two at once?
For example, let’s take a look at our sample game of Flood-it — here’s the public data set if you want to try it yourself.
Now if you were to take a look at our level_complete_quickplay
event, there are two parameters that are of interest to us: The value
parameter, which tells us the user’s final score (i.e. how many moves it took them to clear the board), and the board
parameter, which represents the board size.
Since your score could vary quite a bit based on the board you’re playing, it probably makes sense to group our events by the board
parameter before we start calculating any stats like our users’ average score.
If we were to simply UNNEST
these event parameters like so:
SELECT event_name, event_timestamp, user_pseudo_id, param
FROM `firebase-public-project.analytics_153293282.events_20181003`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND (param.key = "value" OR param.key = "board")
You can see that we have all the information we need, but they span across separate rows, which makes analyzing them kinda tough.
So one solution is to simply re-group these events by looking at which rows have the same user_pseudo_id
and event_timestamp
. If two rows have the same values in these two columns, we can be pretty sure they belong to the exact same event. So we just need to do a little work to consolidate them together…
SELECT
MAX(if(param.key = "value", param.value.int_value, NULL)) AS score,
MAX(if(param.key = "board", param.value.string_value, NULL)) AS board_type
FROM (
SELECT event_name, event_timestamp, user_pseudo_id, param
FROM `firebase-public-project.analytics_153293282.events_20181003`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND (param.key = "value" OR param.key = "board")
)
GROUP BY user_pseudo_id, event_timestamp
…and then we go ahead and analyze these two parameters however we’d like.
But this feels unnecessarily complicated to me. Are we really going to be breaking up all of our parameters into separate rows only to combine them together again a few moments later? It kinda bothers me on a philosophical level, and I’m guessing it’s also pretty inefficient.
SELECT FROM UNNEST to the rescue!
Luckily, there’s another way to fix this by using SELECT FROM UNNEST
. When you use the SELECT FROM UNNEST
function, you’re basically saying, “Hey, I want to UNNEST
this repeated record into its own little temporary table. Once you’ve done that, then go ahead and select one row from it, and place that into our results as if it were any other value.”
I know that’s a little hard to grok at first, so let’s take a look at our example.
First thing I’m going to do is remove our original UNNEST
statement and simplify our query to look at all our level complete events.
SELECT event_name, event_timestamp, user_pseudo_id
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"
Next, I’m going to ask BigQuery to SELECT
the value.int_value
column from our UNNEST
ed event_params
array, where the key
of the event parameter equals "value"
.
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS score
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"
What ends up happening is something a little like this animation, where we’re expanding the event_params
array into its own little temporary table, filtering for the one entry where key = "value"
, then grabbing the value.int_value
from there.
And you get some results that look a little like this.
Once again, we have our data from our original event, and then we have the value of a single event parameter extracted alongside each row.
Now, there are three important notes to keep in mind when you’re making SELECT FROM UNNEST
calls like this:
First, you have to make sure you put the entire SELECT FROM UNNEST
call in parenthesis. If you don’t, BigQuery freaks out over the fact that you have two SELECT
statements and it won’t run your query.
Second, as far as I can tell, this will only work if you return no more than one value from each SELECT FROM UNNEST
call.
Third, you can use this technique multiple times in the same SELECT
statement! So now we can take our previous select statement and add on a second SELECT FROM UNNEST
call to get our board
parameter alongside our value
parameter…
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS score,
(SELECT value.string_value FROM UNNEST(event_params)
WHERE key = "board") AS board_size
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"
…and here they are all nicely organized in the same row.
Then we can go ahead and do any kind of analysis we want, and it’s easy to break up our scores by the size of our game board. Here; let’s go ahead and figure out the average score and standard deviation for each game board type:
SELECT AVG(score) AS average, STDDEV(score) as std_dev, board_size
FROM (
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS score,
(SELECT value.string_value FROM UNNEST(event_params)
WHERE key = "board") AS board_size
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"
)
GROUP BY board_size
Analyze event parameters and user properties together, too!
You can also use this SELECT FROM UNNEST
technique to analyze event parameters and user properties together.
For example, the Flood-it folks use the spend_virtual_currency
event to track when a user spends “extra steps” at the end of a round. The value
parameter helps to track how many steps they spend per event. We’re also tracking how many extra steps a user is seeded with initially with the initial_extra_steps
user property.
So let’s say we want to find out whether there’s any correlation between how many steps a user is initially given, and how many steps they spend during a “use extra steps” event. This involves looking at the value
parameter and the initial_extra_steps
user property together. Once again, both of these values are bundled away in repeated records, so how can we analyze them?
In theory, you could do this by chaining together two UNNEST
calls like so.
SELECT event_name, event_timestamp, user_pseudo_id,
param.value.int_value AS moves_used,
userprop.value.string_value AS initial_extra_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`,
UNNEST (event_params) as param,
UNNEST (user_properties) as userprop
WHERE event_name = "spend_virtual_currency"
AND param.key = "value"
AND userprop.key = "initial_extra_steps"
This would work, but it would do the job by first UNNEST
ing every row against every parameter in your spend_virtual_currency
event, and then UNNEST
ing each of those rows against every user property we’re recording. This means that we’re essentially multiplying every row in our big query data set by (number of parameters * number of user properties). This can get pretty big pretty fast!
Now, maybe BigQuery does some magic under the hood to pre-optimize this, but given that I’ve seen BigQuery engineers visibly wince every time I mention using UNNEST
multiple times, I’m not super convinced here.
Luckily, we can once again call upon SELECT FROM UNNEST
to help us out here. First, I’ll get the value of our value
parameter, much like in our first example:
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"
And then I can also get the value of our user property. (Note that this value is stored as a string, so I’m going to go ahead and cast it as an integer)
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used,
CAST(
(SELECT value.string_value FROM UNNEST(user_properties)
WHERE key = "initial_extra_steps")
AS int64) AS initial_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"
…and now I have each of these values right next to each other in the same row. Hooray!
Not only this code more efficient (I think),but it’s easier to understand.
From here, it’s pretty easy to analyze these parameters. We can see what our average extra steps used value is for each user property type…
SELECT AVG(steps_used) AS average_steps_used, initial_steps
FROM (
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used,
CAST(
(SELECT value.string_value FROM UNNEST(user_properties)
WHERE key = "initial_extra_steps")
AS int64) AS initial_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"
)
WHERE initial_steps IS NOT NULL
GROUP BY initial_steps
ORDER BY initial_steps
…or we could see if there’s any kind of correlation between these two values.
SELECT CORR(steps_used, initial_steps) AS correlation
FROM (
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used,
CAST(
(SELECT value.string_value FROM UNNEST(user_properties)
WHERE key = "initial_extra_steps")
AS int64) AS initial_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"
)
WHERE initial_steps IS NOT NULL
So there ya go, folks! Between using UNNEST
and SELECT FROM UNNEST
, you can make quick work of all of those repeated records that Google Analytics for Firebase likes to use in their BigQuery schema. And, as it turns out, these are the same kinds of repeated records that also show up in Crashlytics and Cloud Messaging data. So I recommend taking the time to get used to these techniques, because they’ll definitely make your life a lot easier.
So, what else do you want to know about using BigQuery with your Firebase data? Let me know in the comments below, and I can work on another post!