How to use the UNNEST function in BigQuery to analyze event parameters in Analytics

Todd Kerpelman
Dec 14, 2018 · 7 min read

Using BigQuery is a great way to generate some custom in-depth analysis of your Google Analytics data, but to really unlock that data, it helps to know a few tricks. Today, let’s talk about one of the most important ones: Using the UNNEST function to analyze event parameters and user properties that you receive along with your Analytics data.

One of the trickier parts of working with Firebase data in BigQuery — and this applies not just to Analytics data but to Crashlytics data, too — is that your data is not organized in nice little rows and columns like this:

How you’re probably imagining your BigQuery data

Instead, your data consists of rows that maybe look a little more like JSON objects, where columns could contain a single bit of data, but they could also contain arrays, structs, or even arrays of structs. Maybe something a little more like this:

What your BigQuery data is really like

Working with this data can be weird at first, but once you learn a couple of tricks, it’s not so bad.

Let’s start by looking at our public sample data for our game of Flood-it. Feel free to follow along; this data is available for anybody to play with.

In particular, I’m interested in the level_complete_quickplay event. This is the event that we record when the user successfully completes a quickplay level in the game. Here, let’s take a look at a few:

If you run this, you’ll see a number of level complete events, along with a couple of other event parameters that seem interesting. We’ve got a value parameter, which is recording the number of moves it took for a player to complete the level, and a board parameter, which records the board size (S, M, or L)

You might also notice that each of these parameters doesn’t have a single value column, but several: int_value, string_value, double_value and float_value.

Basically, you can think of this as kind of a struct that we use to store your parameter values, and in practice most of them end up being nil. But it’s a nice way of making sure we can take a number of different parameters with a bunch of different formats and store them in a consistent way.

So, let’s figure out the average user score just by looking at event parameters where key is equal to “value” . Seems pretty simple, right?

Oh, dear. That’s… a really weird error message.

The problem here is that event_params is essentially an array (actually in BigQuery parlance it’s a “repeated record”, but you can think of it as an array). So while it might contain several rows that themselves have a key field, it doesn’t have one itself.

This is where the UNNEST function comes in. It basically lets you take elements in an array and expand each one of these individual elements. You can then join your original row against each unnested element to add them to your table.

To use a simpler example, let’s imagine we have a table full of spaceships. And crew is an array, much like event_params was in our analytics tables.

Right now, if I wanted to find spaceships that contain a crew member named Zoe, I couldn’t do that with a line like this.

But if I were to write something like this…

What BigQuery will do is take every individual member in my crew array, and add it on as a new column in my row called crew_member It will repeat my original row as needed to accompany each new value. Kinda like in this nifty animation:

Once I’ve done that, it’s easy to select spaceships with crew members named Zoe.

You’ll find that in practice, though, most BigQuery developers will replace the CROSS JOIN with a comma, like so:

It still does the same thing, it just looks cooler. (BigQuery developers are all about looking cool.)

So going back to our example, we essentially have an array (or repeated record) of event_params. So if I were to say:

…what BigQuery will do is take each individual event parameter and expand them out into a new column called param, and repeat the original row as needed. Like this:

You’ll notice that param itself is still a nested object. But it’s a single record, not a repeating one. (Again, kinda like a struct.) So we can query for things like param.key or param.value.int_value. Specifically, we can now look just for rows where param.key is equal to “value”.

Since, in practice, it’s a little weird to have the full repeated event_params object alongside each unnested parameter, I usually drop it and just look at the param object instead.

So with all that in mind, let’s go back to the task of analyzing the average score for users who complete a quickplay level. I can now grab just those events along with the unnested value parameter.

And then, it’s really easy to grab the actual value of the value parameter by looking at param.value.int_value.

From there, I can perform interesting tasks like calculating the mean, getting some quantiles, and figuring out the standard deviation…

Hey, maybe while I’m at it, let’s make a quick histogram of all the values!

So… this is interesting. There’s a pretty big peak around 21 or 22 moves… but if you keep looking, it seems like there’s also a couple of smaller peaks around 29 moves and in the 34–36 range.

The reason for that is the other parameter I was talking about — the board parameter. While the vast majority of games are being played on a small board, there’s a number of games being played on medium and large boards too, and those probably account for those smaller peaks.

In fact, we can kinda see that by looking at both the board and the value parameters for our event.

Sure enough, it seems like the “value” parameter is higher when it follows a board of M or L.

But how can we actually prove this out in BigQuery? Right now, making this analysis is kinda difficult because we have different values in different rows.

Now, there is a way we can bunch these rows together. If we add back the pseudo_user_id (essentially, a unique ID assigned to each app instance) and event_timestamp to each of our columns, we can then group together parameters for the same event based on those two values.

And then we can analyze score by board type…

But this seems like a pretty awkward way of doing things. Wouldn’t it be better if we could find a way of more easily getting multiple values in the same row? Well it turns out you can, and we’ll find out how… in my next post! (Woo! Cliffhanger ending!!)

Firebase Developers

Tutorials, deep-dives, and random musings from Firebase…

Firebase Developers

Tutorials, deep-dives, and random musings from Firebase developers all around the world. Views expressed are those of the authors and don’t necessarily reflect those of Firebase or its parent companies.

Todd Kerpelman

Written by

Former Game Designer. Current Developer Advocate at Google. Dad. Often sleep deprived.

Firebase Developers

Tutorials, deep-dives, and random musings from Firebase developers all around the world. Views expressed are those of the authors and don’t necessarily reflect those of Firebase or its parent companies.