How to use SELECT FROM UNNEST to analyze multiple parameters in BigQuery for Analytics

(Man, I need shorter titles for my blog posts)

Todd Kerpelman
Firebase Developers
8 min readFeb 4, 2019

--

Hey, there BigQuery-for-Google-Analytics-for-Firebase developers!

In our previous blog post, I showed you how to use the UNNEST function in BigQuery to analyze event parameters in your Google Analytics for Firebase data.

You need to use this function because normally, event parameters are stored as a repeated record, which you can think of as something like an array. By UNNESTing them, you’re able to “break apart” this array, place each individual event parameter into a new row, and then duplicate the original row for each of these individual parameters. Much like in this animation:

Now this is really useful when you want to analyze one event parameter in your analytics data. But what happens if you want to analyze two at once?

For example, let’s take a look at our sample game of Flood-it — here’s the public data set if you want to try it yourself.

Now if you were to take a look at our level_complete_quickplay event, there are two parameters that are of interest to us: The value parameter, which tells us the user’s final score (i.e. how many moves it took them to clear the board), and the board parameter, which represents the board size.

This player got a score of 18 while playing on a small board

Since your score could vary quite a bit based on the board you’re playing, it probably makes sense to group our events by the board parameter before we start calculating any stats like our users’ average score.

If we were to simply UNNEST these event parameters like so:

SELECT event_name, event_timestamp, user_pseudo_id, param
FROM `firebase-public-project.analytics_153293282.events_20181003`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND (param.key = "value" OR param.key = "board")

You can see that we have all the information we need, but they span across separate rows, which makes analyzing them kinda tough.

So one solution is to simply re-group these events by looking at which rows have the same user_pseudo_id and event_timestamp. If two rows have the same values in these two columns, we can be pretty sure they belong to the exact same event. So we just need to do a little work to consolidate them together…

SELECT 
MAX(if(param.key = "value", param.value.int_value, NULL)) AS score,
MAX(if(param.key = "board", param.value.string_value, NULL)) AS board_type

FROM (
SELECT event_name, event_timestamp, user_pseudo_id, param
FROM `firebase-public-project.analytics_153293282.events_20181003`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND (param.key = "value" OR param.key = "board")
)
GROUP BY user_pseudo_id, event_timestamp

…and then we go ahead and analyze these two parameters however we’d like.

But this feels unnecessarily complicated to me. Are we really going to be breaking up all of our parameters into separate rows only to combine them together again a few moments later? It kinda bothers me on a philosophical level, and I’m guessing it’s also pretty inefficient.

SELECT FROM UNNEST to the rescue!

Luckily, there’s another way to fix this by using SELECT FROM UNNEST . When you use the SELECT FROM UNNEST function, you’re basically saying, “Hey, I want to UNNEST this repeated record into its own little temporary table. Once you’ve done that, then go ahead and select one row from it, and place that into our results as if it were any other value.”

I know that’s a little hard to grok at first, so let’s take a look at our example.

First thing I’m going to do is remove our original UNNEST statement and simplify our query to look at all our level complete events.

SELECT event_name, event_timestamp, user_pseudo_id
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"

Next, I’m going to ask BigQuery to SELECT the value.int_value column from our UNNESTed event_params array, where the key of the event parameter equals "value".

SELECT event_name, event_timestamp, user_pseudo_id, 
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS score

FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"

What ends up happening is something a little like this animation, where we’re expanding the event_params array into its own little temporary table, filtering for the one entry where key = "value", then grabbing the value.int_value from there.

And you get some results that look a little like this.

Once again, we have our data from our original event, and then we have the value of a single event parameter extracted alongside each row.

Now, there are three important notes to keep in mind when you’re making SELECT FROM UNNEST calls like this:

First, you have to make sure you put the entire SELECT FROM UNNEST call in parenthesis. If you don’t, BigQuery freaks out over the fact that you have two SELECT statements and it won’t run your query.

Second, as far as I can tell, this will only work if you return no more than one value from each SELECT FROM UNNEST call.

Third, you can use this technique multiple times in the same SELECT statement! So now we can take our previous select statement and add on a second SELECT FROM UNNEST call to get our board parameter alongside our value parameter…

SELECT event_name, event_timestamp, user_pseudo_id,  
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS score,
(SELECT value.string_value FROM UNNEST(event_params)
WHERE key = "board") AS board_size

FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"

…and here they are all nicely organized in the same row.

Then we can go ahead and do any kind of analysis we want, and it’s easy to break up our scores by the size of our game board. Here; let’s go ahead and figure out the average score and standard deviation for each game board type:

SELECT AVG(score) AS average, STDDEV(score) as std_dev, board_size
FROM (

SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS score,
(SELECT value.string_value FROM UNNEST(event_params)
WHERE key = "board") AS board_size
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "level_complete_quickplay"
)
GROUP BY board_size
Not surprisingly, it takes more moves to clear out a larger board.

Analyze event parameters and user properties together, too!

You can also use this SELECT FROM UNNEST technique to analyze event parameters and user properties together.

For example, the Flood-it folks use the spend_virtual_currency event to track when a user spends “extra steps” at the end of a round. The value parameter helps to track how many steps they spend per event. We’re also tracking how many extra steps a user is seeded with initially with the initial_extra_steps user property.

So let’s say we want to find out whether there’s any correlation between how many steps a user is initially given, and how many steps they spend during a “use extra steps” event. This involves looking at the value parameter and the initial_extra_steps user property together. Once again, both of these values are bundled away in repeated records, so how can we analyze them?

In theory, you could do this by chaining together two UNNEST calls like so.

SELECT event_name, event_timestamp, user_pseudo_id, 
param.value.int_value AS moves_used,
userprop.value.string_value AS initial_extra_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`,
UNNEST (event_params) as param,
UNNEST (user_properties) as userprop
WHERE event_name = "spend_virtual_currency"
AND param.key = "value"
AND userprop.key = "initial_extra_steps"

This would work, but it would do the job by first UNNESTing every row against every parameter in your spend_virtual_currency event, and then UNNESTing each of those rows against every user property we’re recording. This means that we’re essentially multiplying every row in our big query data set by (number of parameters * number of user properties). This can get pretty big pretty fast!

I spent more time on this animation than I care to admit.

Now, maybe BigQuery does some magic under the hood to pre-optimize this, but given that I’ve seen BigQuery engineers visibly wince every time I mention using UNNEST multiple times, I’m not super convinced here.

Luckily, we can once again call upon SELECT FROM UNNEST to help us out here. First, I’ll get the value of our value parameter, much like in our first example:

SELECT event_name, event_timestamp, user_pseudo_id, 
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used

FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"

And then I can also get the value of our user property. (Note that this value is stored as a string, so I’m going to go ahead and cast it as an integer)

SELECT event_name, event_timestamp, user_pseudo_id, 
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used,
CAST(
(SELECT value.string_value FROM UNNEST(user_properties)
WHERE key = "initial_extra_steps")
AS int64) AS initial_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"

…and now I have each of these values right next to each other in the same row. Hooray!

Not only this code more efficient (I think),but it’s easier to understand.

From here, it’s pretty easy to analyze these parameters. We can see what our average extra steps used value is for each user property type…

SELECT AVG(steps_used) AS average_steps_used, initial_steps 
FROM (

SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used,
CAST(
(SELECT value.string_value FROM UNNEST(user_properties)
WHERE key = "initial_extra_steps")
AS int64) AS initial_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"
)
WHERE initial_steps IS NOT NULL
GROUP BY initial_steps
ORDER BY initial_steps

…or we could see if there’s any kind of correlation between these two values.

SELECT CORR(steps_used, initial_steps) AS correlation 
FROM (
SELECT event_name, event_timestamp, user_pseudo_id,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "value") AS steps_used,
CAST(
(SELECT value.string_value FROM UNNEST(user_properties)
WHERE key = "initial_extra_steps")
AS int64) AS initial_steps
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "spend_virtual_currency"
)
WHERE initial_steps IS NOT NULL

So there ya go, folks! Between using UNNEST and SELECT FROM UNNEST, you can make quick work of all of those repeated records that Google Analytics for Firebase likes to use in their BigQuery schema. And, as it turns out, these are the same kinds of repeated records that also show up in Crashlytics and Cloud Messaging data. So I recommend taking the time to get used to these techniques, because they’ll definitely make your life a lot easier.

So, what else do you want to know about using BigQuery with your Firebase data? Let me know in the comments below, and I can work on another post!

--

--

Todd Kerpelman
Firebase Developers

Former Game Designer. Current Developer Advocate at Google. Dad. Often sleep deprived.