Analyzing Custom Retention Cohorts Using BigQuery and Google Analytics for Firebase

Retention is a stat a lot of app developers care deeply about; after all, if you’re making an app that your users keep coming back to week after week, it probably means you’ve designed something really useful that meets their needs. From a business standpoint, high retention is good because it means you don’t need to spend as much on acquiring new users, and the users you have are probably more likely to tell friends about your app or make an in-app purchase.

So that’s why a lot of developers care about the Retention report that you get out-of-the-box when you add Google Analytics for Firebase to your project. This report tells you, out of the users who first opened your app during one particular week, how many of them came back in subsequent weeks? Here, for instance, is the retention report for our demo game of Flood it.

We can see that out of our users who started playing during the week of February 3rd, about 25.7% of them came back to our app a week later. About 16.8% came back to the app 2 weeks after that, and so on. Note that this report uses “Classic N-day Retention”, meaning that if a user uses the app during week 0 and then comes back during week 3 (but none of the weeks in between), they’re included among the week 3 users. This means that it’s possible for retention to go up from one week to the next.

There’s already a bit of customization you can do with this retention report. You can filter it by platform or by campaign, so you can find out which user acquisition campaigns are working best from a retention standpoint. But sometimes you might want to do a little more custom analysis. And that’s where BigQuery comes in. Let’s see how to create our own version of this retention report so we can customize it further!

Note: For the purpose of this tutorial, I’ll be using the public demo data that we have available for all of you to use. This is a sanitized version of our data set, which allows you to play along at home with my sample code. It contains several months worth of data, which is enough for us to generate a pretty good retention report.

So first off, let’s start by gathering up a list of players who first started playing our game (i.e. executed a first_open event) during the week of, say, August 1st, 2018. To do this, I’m using user_pseudo_id, which is a string that is randomly generated per app instance, to represent a unique player. Now you might think that you could just query all of the entries in our events tables between August 1st and August 7th, like so:

This kinda works, but there are two problems with this query.

First, events are organized into these tables based on their time in UTC (or GTM+0, depending on how you like to name your time zones). If you’re in a different time zone, you might want to define your days differently.

Second, it looks like there was some kind of bug or change in behavior where what I just said in the previous paragraph wasn’t always the case. If I look at this old sample data, I can see that the events in my August 1st table actually start at 7:01 AM UTC and end at 6:59 PM on August 2nd. This doesn’t seem to happen with more recent data, so I’m guessing they fixed the bug or changed the default behavior.

But still, rather than relying on analytics placing the right events into the right tables for me, I generally prefer searching though my tables using the specific timestamps for the dates that I’m interested in, in the time zone I’m interested in.

Your handy cheat sheet for converting dates into Analytics timestamps

Google Analytics for Firebase records all of its event timestamps in microseconds. Yes, that’s one millionth of a second. So to turn any arbitrary date into an Analytics timestamp, we’ll first want to create a timestamp object in BigQuery, then use the UNIX_MICROS function to convert it to the format we need.

For example, these two functions will let us define August 1st 2018 in microseconds based on the UTC timezone…

UNIX_MICROS(TIMESTAMP “2018–08–01 00:00:00”)
UNIX_MICROS(TIMESTAMP(“2018–08–01 00:00:00”, “UTC”))

Since I am in California, I could use any of these functions to specify August 1st, 2018 in the Los Angeles time zone.

UNIX_MICROS(TIMESTAMP(“2018–09–21 00:00:00”, “America/Los_Angeles”))
UNIX_MICROS(TIMESTAMP(“2018–09–21 00:00:00”, “-7:00”))
UNIX_MICROS(TIMESTAMP “2018–09–21 00:00:00–7”)

If you were in Tokyo, you could write something like this :

UNIX_MICROS(TIMESTAMP(“2018–09–21 00:00:00”, “Asia/Tokyo”))
UNIX_MICROS(TIMESTAMP(“2018–09–21 00:00:00”, “+9:00”))

Not sure what text you can use to specify a time zone? There’s a handy list here.

With that in mind, let’s get a list of all users who first opened our app sometime after midnight on August 1st, and just before midnight on August 8th, by using timestamps.

You’ll notice that I’m expanding my tables in the _TABLE_SUFFIX to be one day before and one day after the days I’m interested in. This will help make sure I capture events that might be in the August 8th table that I consider happening on August 7th, and vice versa.

Now I have a list of all of my users who first used my app during that week. If I select the COUNT() from this query, it looks like that’s 408 users.

Using the IN clause to create group subsets

Let’s calculate our week 1 retention! We want to find out how many users who used the app between August 8th and August 14th were among the group of players who started using the app the week before.

Luckily, this is easy to do with the IN clause. IN is a function that will return true if a value exists in a list of values. So for example…

SELECT 1 IN (1, 3, 5)

…evaluates to true since 1 is in that list. In our case, we’re not going to use a hard-coded list like this; we’re going to use the results from that “week 0” query as our list. Then we’ll tell BigQuery that we’re only interested in users who appeared in that week 0 list.

So our query might look a little something like this:

Basically, I’m saying, “Let’s go ahead and create a list of active users, but only if these also appear in the list of new users from last week” The only real tricky part here is making sure I have all of my dates right. It’s really easy to get an off-by-one error here.

Looking at these results, it looks as though I have 93 users who appeared in my week 1 list. If I divide 93 / 408, I get a week 1 retention rate of 22.8%

Now I can go ahead and repeat this query, going back a week at a time and diving those results by the original number of 408 to get my results for week 2 and week 3 and so on. But why spend minutes performing these manual steps when I could spend hours figuring out how to do it automatically?

Using WITH… AS to refer to results later

We’re going to use another bit of SQL to help us out here: the WITH… AS statement. Essentially, this lets us refer to some results from a query by creating a label that I can use later. For example, I could say:

WITH executives AS (SELECT * FROM employees WHERE level > 9)
SELECT * FROM executives WHERE first_name = “Larry”

So let’s look at how we could convert our previous statement into something using a WITH clause….

Note that in order to find the users in week 1 who were also in week 0, I couldn’t use the IN statement from above. I had to perform a JOIN statement instead. This is an inner join by default, which basically says, “You can only select these user IDs if they appear in both of these tables.”

You’ll notice this gives me a result of 93 users, the same as above.

At this point, it’s fairly easy to copy-and-paste-and-modify my way to several weeks worth of data! Let’s try that:

And then I can make this whole thing a little sub-query that I can then use to calculate percentages!

Let’s clean things up a little!

So this works perfectly fine, but it’s what we in the engineering world like to call, “A bit of a mess”. What would happen if we suddenly wanted to change the start date of our report? I would need to change 16 different dates in that query, and I’m pretty sure I’d get at least one of them wrong.

Luckily, Felipe Hoffa was around to help me clean this up quite a bit. The trick is to once again use our friend WITH… AS to define our data set, as well as a few values that basically act like variables. For instance, we can go back and revise our “get all users in week 0” query to look a little something like this:

And now we can expand this out to cover all of our weeks again:

This gives us the same results as above, but it’s much easier to change. Now if our Business Analyst decides she wants a report starting on Sunday, July 29th, we only need to adjust the three values at the beginning:

By the way, you might wonder if our query is made any less efficient now that we’re using the larger range of table suffixes for the entire dataset. The answer is that it might be a bit slower, since we now have to eliminate many more rows for each week, but because we’re analyzing the same amount of data as before in total, our BigQuery usage should be about the same from a pricing perspective.

Customize this, maybe?

Of course, all of this work is nice, but we’ve essentially just re-generated the same graphs we already get out of the box from the Firebase console. How about we run a few customized queries that we can only get by using BigQuery?

Retention for a specific version

Here’s a common example you might run into. It looks like our BigQuery data includes samples from many different versions of our app. Perhaps we’re interested in seeing what our retention looks like for just one specific version. That’s easy to do by adding the app version data into our analytics data set, then only selecting week 0 users who are using that specific version.

Note that I’m only specifying the app version among my week_0_users selection. This means that if a user upgrades their app (and therefore has a new version) along the way, they’ll still be included in my retention report. To be honest, I think this is the right approach. If I didn’t, then we’d be excluding users as they upgrade their app, which would make my retention look unnaturally low.

Retention for a specific device type

We could similarly break out retention by device type or family. It turns out our public sample data doesn’t include any device type information because that was scrubbed to make our lawyers happy. But trust me when I say this query would work in a typical data set:

Here it’s okay to filter for devices across our entire data set, because this value doesn’t change. If a user switches devices or plays the game on two different devices at once, they’ll be recorded with two different user_psuedo_ids.

Define different retention periods

If we wanted to define a different time period for our retention reports (like looking across every 3 days instead of a week), that would be a really simple. We could just change the definition of one “week” in our report. For example, this change would be enough to look at retention cohorts across a 3 day period, instead of a week.

Of course, we now have this problem where our query is full of variables that are defined as a “week” which are no longer actually a week, which really bugs me. It kinda makes me wish I’d used a term like “interval” instead. Well, live and learn.

Also, keep in mind that if we extended this interval to be more than 7 days, we’d also want to make sure our table suffixes are sufficiently large enough to cover the entire time period.

Break out retention by user properties

This one is a little tricky because user properties are subject to change throughout a user’s lifetime (and therefore depending on how you include this in your measurements, you could artificially increase or decrease your users’ retention). Also, most of them aren’t set when the first_open event is recorded, which means you’ll need to find them later on in the user’s event lifecycle.

In our example, we have a user property called initial_extra_steps which determines how many extra steps our user first receives when they start the game. We have this theory that players who were granted more initial steps played the game longer and kept coming back to it, which is something we can now validate using BigQuery.

While this user property doesn’t change once it’s set, it is set sometime after the first_open event. So to figure this out, I’m first going to create a list of users who ever had an initial_extra_steps value of 20.

Handy tip: Not sure how that SELECT… FROM UNNEST() statement works? Make sure to check out my other tutorial on the topic.

Then, I can filter my analytics_data group to only include users who belong to this group.

Which gives me the full query:

Is this getting a little messy? Yes! Could I make it cleaner? I don’t know! Honestly, I’m just figuring things out as I go. (Bet you wish you knew that before you started this blog post, huh?)

Break out retention by events

The nice thing here is that it’s easy to take that query of “users who ever had an initial_steps value of 20” and modify it to be any arbitrary group of users. For instance, what if we wanted to see what our retention would look like among users who ever retried a level? We could do that by changing our subquery to grab a list of users who ever encountered a level_retry_quickplay event.

Have fun!

So those are a few ways that I was able to get some customized retention reports using BigQuery and my Google Analytics for Firebase data.

What about you? Do you have other fun reports you want to share? Want me to make other BigQuery tutorials? Let me know in the comments below!

If you’re interested in seeing some other audiences or fun tricks you can create with BigQuery, check out our official documentation, or some of my other blog posts.

Special thanks to Felipe Hoffa and Alin Jula for their BigQuery help!