How Do I Create a Closed Funnel in Google Analytics for Firebase Using BigQuery? (Part 2)

Todd Kerpelman
Firebase Developers
15 min readDec 3, 2018

--

Let’s build a multi-part funnel!

So if you’re just joining us here in Adventures in BigQuery land, we spent our previous post learning how to build a two-part closed funnel with our analytics data. Along the way, we learned…

  • Closed Funnels are probably what you traditionally think of when you think of a funnel, whereas Open Funnels are more like “a couple of event counts conveniently displayed next to each other”
  • We have publicly available sample data that you can experiment with, and we’re going to use this sample as our working example moving forward, although keep in mind that some bits have been “sanitized to make our lawyers happy”
  • By using user_pseudo_id and event_timestamp, you can get a sense of our user journeys throughout the app
  • With the LEAD OVER function, you can “peek ahead” in your database table. In our case, we used this function to see what event a user encountered after their current one.
  • Subqueries allow you to take the results from one query, and use them as the starting table for another query. It’s an easy way to take a complicated query and break it down into small, easy-to-digest steps, even if the nested query does look a little intimidating at first.
  • With a few quick changes, we can change our funnel to count users instead of events by counting distinct user_pseudo_id values
  • With another small set of changes, we can also change our funnel to be time-sensitive by looking at the timestamp of different events.

So let’s move on to our next example: Let’s create a 3-part funnel!

Once again, we’re going to use our public demo project, a puzzle game called Flood-it! as our example.

I’m 90% sure this game was made without an art budget.

Flood-it has three Analytics events we’re interested in for this example…

  • first_open, an event that Google Analytics for Firebase automatically records for when users open the app for the first time.
  • completed_5_levels, when a user has completed level 5.
  • use_extra_steps, when a player decides to use one of the limited “extra steps” available in the game to complete a level.

One day, our producer wants to know how many people had to use extra steps before completing level 5. We don’t expect this number to be very high, since the first five levels are pretty easy. But there are probably a few players out there who needed the help.

Again, if we tried to measure this with the open funnel from the Firebase dashboard, we’d probably end up with something like this:

Probably not the results we’re expecting, and again that’s because most of our players will be using their extra steps sometime after level 5. We want to just measure the players who used extra steps before completing level 5. Once again, this seems like a good opportunity for a closed funnel!

So like before, let’s start by looking at the events we’re interested in, and we’ll order them by our pseudo-id and timestamp

SELECT event_name, user_pseudo_id , event_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3

This once again gives us user journeys, which we can essentially read by scanning down through our table. Now, like before, we once again use the LEAD OVER function to create “triplets” of events, so we can view trios of events side by side.

SELECT event_name, user_pseudo_id , event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id
ORDER BY event_timestamp) AS next_event,
LEAD (event_name, 2) OVER (PARTITION BY user_pseudo_id
ORDER BY event_timestamp) AS third_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3

Once we’ve done that, it seems like it would be fairly easy to start counting funnel events, right? All first_open events will be marked as the first part of our funnel. Any first_open events followed by use_extra_steps events will be marked as the second part of our funnel, and so on. Then we can simply count those up to get our three parts of the funnel.

SELECT countif(event_name = "first_open") AS funnel_1_count, 
countif(event_name = "first_open" AND next_event =
"use_extra_steps") AS funnel_2_count,
countif(event_name = "first_open" AND next_event =
"use_extra_steps" AND third_event = "completed_5_levels") AS
funnel_3_count
FROM (
SELECT event_name, user_pseudo_id , event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id
ORDER BY event_timestamp) AS next_event,
LEAD (event_name, 2) OVER (PARTITION BY user_pseudo_id
ORDER BY event_timestamp) AS third_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3
)

So… are we done? Well, you can probably tell by the fact that Medium isn’t asking you to clap for me yet that we’re not.

It turns out there’s one problem with our approach, which may or may not be applicable to your own funnel. In Flood-it, people can (and do) hit the use_extra_steps event multiple times.

So, for example, let’s take a look at user FCFF0D7C19A902C4DB4EAF8DC586105B in our events table.

You can see from the table above that our user hit a first_open event, used extra steps, used extra steps two more times, and then hit the completed_5_levels event. This is fine, but when we then turn this sequence of events into triplets, we end up with a few rows that don’t fit into our neat and clean funnel definitions, so they get excluded from our funnel count.

Notice that none of these rows fits our definition of “Part 3 of our funnel”

Now, in some situations, maybe this is okay. And it’s quite possible that this is an issue you’ll never encounter. But given the context of our producer’s question, we’ll still want to include this user’s journey in our funnel count.

So how do we fix this? In this particular case, I think the easiest way would simply be to ignore every case where our user triggers more than one of the same event in a row. Eliminating duplicates in this case would eliminate the “noise” from our event sequence, and ensure we get the right event triplets for our funnel definitions.

So let’s take a look at our inner query again. This time, I’ll use LEAD OVER to see what the next event is…

SELECT event_name, user_pseudo_id , event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS next_event

FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3
This is just a list of the events our users encountered, followed by whatever they encountered next

And then we can run a query on these results to eliminate any row where the current event and the next event are the same. Note that the != operator at the end there won’t actually work if next_event is null, which is why I need the OR operator there, too…

SELECT event_name, user_pseudo_id, event_timestamp, next_event FROM
(
SELECT event_name, user_pseudo_id , event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3
)
WHERE event_name != next_event OR next_event IS NULL
All those pesky duplicate events are gone!

Now we can go back and reapply our logic to gather the first, second, and third events from this new “no duplicate events” table…

SELECT event_name as event_1,
LEAD(event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_2,
LEAD(event_name, 2) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_3,
user_pseudo_id, event_timestamp
FROM
(
SELECT event_name, user_pseudo_id , event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3
)
WHERE event_name != next_event OR next_event IS NULL

And now, we can finally go back and count up the cases we’re interested in to measure our funnel.

SELECT COUNTIF(event_1 = "first_open") AS funnel_1_count,
COUNTIF(event_1 = "first_open" AND event_2 = "use_extra_steps") AS
funnel_2_count,
COUNTIF(event_1 = "first_open" AND event_2 = "use_extra_steps" AND
event_3 = "completed_5_levels") AS funnel_3_count
FROM
(
SELECT event_name as event_1,
LEAD(event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_2,
LEAD(event_name, 2) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_3,
user_pseudo_id, event_timestamp
FROM
(
SELECT event_name, user_pseudo_id , event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3
)
WHERE event_name != next_event OR next_event IS NULL
)

Hooray!

Oh, and what if our producer want to find out how many people completed 5 levels without using any extra steps? We can find that out, too, simply by looking for that particular series of events in our table.

SELECT COUNTIF(event_1 = "first_open") AS funnel_1_count,
COUNTIF(event_1 = "first_open" AND event_2 = "use_extra_steps") AS
funnel_2_count,
COUNTIF(event_1 = "first_open" AND event_2 = "use_extra_steps" AND
event_3 = "completed_5_levels") AS funnel_3_count,
COUNTIF(event_1 = "first_open" AND event_2 = "completed_5_levels")
AS completed_without_steps

FROM
(
SELECT event_name as event_1,
LEAD(event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_2,
LEAD(event_name, 2) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_3,
user_pseudo_id, event_timestamp
FROM
(
SELECT event_name, user_pseudo_id , event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2,3
)
WHERE event_name != next_event OR next_event IS NULL
)
I was little surprised how few people reach level 5, but I’ve realized this doesn’t include players who play the more popular “Quickplay” mode

Variations: Counting users, and time restrictions

What if our producer wants to make sure we’re counting by users, instead of individual events? Well, I think we would follow the same technique as in the previous blog post. First, we’ll go ahead and record our user’s pseudo-id if they meet the criteria to hit the first, second, or third parts of our funnel.

SELECT
IF(event_1 = "first_open", user_pseudo_id, NULL) AS funnel_1_user,
IF(event_1 = "first_open" AND event_2 = "use_extra_steps",
user_pseudo_id, NULL) AS funnel_2_user,
IF(event_1 = "first_open" AND event_2 = "use_extra_steps"
AND event_3 = "completed_5_levels", user_pseudo_id, NULL) AS
funnel_3_user
FROM (
SELECT event_name AS event_1,
LEAD(event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_2,
LEAD(event_name, 2) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_3,
user_pseudo_id, event_timestamp
FROM (
SELECT event_name, user_pseudo_id, event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2, 3
)
WHERE event_name != next_event OR next_event IS NULL
)

Then we can use COUNT(DISTINCT) to make sure each user only gets included once.

SELECT COUNT(DISTINCT funnel_1_user) as users_1,
COUNT(DISTINCT funnel_2_user) as users_2,
COUNT(DISTINCT funnel_3_user) as users_3
FROM (
SELECT
IF(event_1 = "first_open", user_pseudo_id, NULL) AS funnel_1_user,
IF(event_1 = "first_open" AND event_2 = "use_extra_steps",
user_pseudo_id, NULL) AS funnel_2_user,
IF(event_1 = "first_open" AND event_2 = "use_extra_steps"
AND event_3 = "completed_5_levels", user_pseudo_id, NULL) AS
funnel_3_user
FROM (
SELECT event_name AS event_1,
LEAD(event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_2,
LEAD(event_name, 2) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_3,
user_pseudo_id, event_timestamp
FROM (
SELECT event_name, user_pseudo_id, event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2, 3
)
WHERE event_name != next_event OR next_event IS NULL
)
)

In our case, the results are exactly the same as our event counts, but that makes sense if you think about it. Our users are only going to hit the first_open and completed_5_levels events once per lifetime, anyway. And we’ve explicitly eliminated any situations where you could hit the use_extra_steps event multiple times.

And, yes, we could also add in time restrictions as well. Let’s say we want to find out how many people made it through each step of our funnel in 15 minutes. We would start by adding some timestamps to each of our events in the inner query…

SELECT event_name AS event_1, event_timestamp as event_1_timestamp,
LEAD(event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_2,
LEAD(event_timestamp, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS event_2_timestamp,
LEAD(event_name, 2) OVER (PARTITION BY user_pseudo_id ORDER BY
event_timestamp) AS event_3,
LEAD(event_timestamp, 2) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS event_3_timestamp,
user_pseudo_id, event_timestamp
FROM (
SELECT event_name, user_pseudo_id, event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2, 3
)
WHERE event_name != next_event OR next_event IS NULL

…although it’s at this point that my co-worker Felipe noted that I could simplify all these redundant PARTITION BY calls by using a WINDOW call instead.

SELECT event_name AS event_1, event_timestamp as event_1_timestamp,
LEAD(event_name, 1) OVER (user_event_window) AS event_2,
LEAD(event_timestamp, 1) OVER (user_event_window) AS
event_2_timestamp,
LEAD(event_name, 2) OVER (user_event_window) AS event_3,
LEAD(event_timestamp, 2) OVER (user_event_window) AS
event_3_timestamp,
user_pseudo_id, event_timestamp
FROM (
SELECT event_name, user_pseudo_id, event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2, 3
)
WHERE event_name != next_event OR next_event IS NULL
WINDOW user_event_window AS (PARTITION BY user_pseudo_id ORDER BY
event_timestamp)

From here, we can add some time requirements to our outer query. Remember that Google Analytics timestamps are measured in microseconds, so 1 second is 1 * 1000 * 1000 microseconds. You can see that our final query is pretty verbose, but it’s nothing too complicated if you take the time to break it down.

SELECT COUNTIF(event_1 = "first_open") as funnel_1_count,
COUNTIF(event_1 = "first_open" AND event_2 = "use_extra_steps" AND
event_2_timestamp - event_1_timestamp < 15 * 60 * 1000 * 1000)
as funnel_2_count,
COUNTIF(event_1 = "first_open" AND event_2 = "use_extra_steps" AND
event_3 = "completed_5_levels" AND event_2_timestamp -
event_1_timestamp < 15 * 60 * 1000 * 1000
AND event_3_timestamp - event_2_timestamp < 15 * 60 * 1000 *
1000) as funnel_3_count
FROM (
SELECT event_name AS event_1, event_timestamp as event_1_timestamp,
LEAD(event_name, 1) OVER (user_event_window) AS event_2,
LEAD(event_timestamp, 1) OVER (user_event_window) AS
event_2_timestamp,
LEAD(event_name, 2) OVER (user_event_window) AS event_3,
LEAD(event_timestamp, 2) OVER (user_event_window) AS
event_3_timestamp,
user_pseudo_id, event_timestamp
FROM (
SELECT event_name, user_pseudo_id, event_timestamp,
LEAD (event_name, 1) OVER (PARTITION BY user_pseudo_id ORDER
BY event_timestamp) AS next_event
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name IN ("first_open", "use_extra_steps",
"completed_5_levels")
ORDER BY 2, 3
)
WHERE event_name != next_event OR next_event IS NULL
WINDOW user_event_window AS (PARTITION BY user_pseudo_id ORDER BY
event_timestamp)
)

And Now for Something Completely Different: Calculating Funnels by LEFT JOINs

There’s one more case that might be problematic. What happens if a user goes through one of these two flows?

first_open -> use_extra_steps -> first_open -> completed_5_levels

first_open -> completed_5_levels -> use_extra_steps -> completed_5_levels

Right now, that first sequence would hit part 1 (first_open) and part 2 of our funnels (first_open -> use_extra_steps), but not part 3. Similarly, that second sequence would only hit part 1 of our funnel.

Now, granted, both of these sequences are essentially impossible in our current setup, since the first_open and completed_5_levels events really only happen once. And, quite possibly, depending on your situation, excluding them from your funnel might be the right thing to do. This is a situation when you’re doing to need to look at the question you’re really trying to answer and figure out what the “correct” calculation is.

But what if we did want to include these cases in our funnel somehow? What if we decide that both of these events should qualify as hitting all three parts of our funnel?

I think, after trying a number of different combinations, the best way to accomplish this is to try a completely different technique, one that I found courtesy of Troy Shu’s blog post, which is to do joins on events where you match up rows by user pseudo-ids, but only when the timestamp of the second event is greater than that of the first.

Here, let’s take a look. Here’s a query where we’re looking for everybody who opened the app, and then used extra steps at a later point.

SELECT event_name, user_pseudo_id, event_timestamp, e2_name, 
e2_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
LEFT JOIN (
SELECT event_name as e2_name, user_pseudo_id as e2_id,
event_timestamp as e2_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "use_extra_steps"
)
ON user_pseudo_id = e2_id
AND event_timestamp < e2_timestamp
WHERE event_name = "first_open"

We can then do this a third time and look for people who opened the app, then used extra steps, then completed five levels.

SELECT event_name, user_pseudo_id, event_timestamp, e2_name,   
e2_timestamp, e3_name, e3_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
LEFT JOIN (
SELECT event_name as e2_name, user_pseudo_id as e2_id,
event_timestamp as e2_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "use_extra_steps"
)
ON user_pseudo_id = e2_id
AND event_timestamp < e2_timestamp
LEFT JOIN (
SELECT event_name as e3_name, user_pseudo_id as e3_id,
event_timestamp as e3_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "completed_5_levels"
)
ON e3_id = e2_id
AND e2_timestamp < e3_timestamp
WHERE event_name = "first_open"

Note that we’re once again running into the duplicate events issue we ran into earlier. Here’s that same user we encountered earlier who asked for extra steps twice before completing five levels.

The question is, how should we count this? Once again, this might be up to you and how you want to interpret this funnel. If you feel like this should count as one first_open event, three use_extra_steps events, and one completed_5_levels event, you can easily do that by only counting events with a distinct timestamp:

SELECT COUNT(DISTINCT event_timestamp) AS f1_events, 
COUNT(DISTINCT e2_timestamp) AS f2_events,
COUNT(DISTINCT e3_timestamp) AS f3_events FROM (
SELECT event_name, user_pseudo_id, event_timestamp, e2_name,
e2_timestamp, e3_name, e3_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
LEFT JOIN (
SELECT event_name as e2_name, user_pseudo_id as e2_id,
event_timestamp as e2_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "use_extra_steps"
)
ON user_pseudo_id = e2_id
AND event_timestamp < e2_timestamp
LEFT JOIN (
SELECT event_name as e3_name, user_pseudo_id as e3_id,
event_timestamp as e3_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "completed_5_levels"
)
ON e3_id = e2_id
AND e2_timestamp < e3_timestamp
WHERE event_name = "first_open"
)

But if you feel like each of these events should only be counted once (basically the way we did above) the easiest way might be to count distinct users.

SELECT 
COUNT(DISTINCT user_pseudo_id) AS f1_users,
COUNT(DISTINCT e2_id) AS f2_users,
COUNT(DISTINCT e3_id) AS f3_users FROM (
SELECT event_name, user_pseudo_id, event_timestamp, e2_id,
e2_name, e2_timestamp, e3_name, e3_id, e3_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
LEFT JOIN (
SELECT event_name AS e2_name, user_pseudo_id AS e2_id,
event_timestamp AS e2_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "use_extra_steps"
)
ON
user_pseudo_id = e2_id
AND event_timestamp < e2_timestamp
LEFT JOIN (
SELECT event_name AS e3_name, user_pseudo_id AS e3_id,
event_timestamp AS e3_timestamp
FROM `firebase-public-project.analytics_153293282.events_20181003`
WHERE event_name = "completed_5_levels"
)
ON
e3_id = e2_id
AND e2_timestamp < e3_timestamp
WHERE event_name = "first_open"
)

Why are there more f2_users this time around? It’s because there are two users who fit the pattern of first_open -> completed_5_levels -> use_extra_steps. If you think about it, they should qualify using the above logic of “People who used extra steps sometime after first_open”, but they don’t quality if we use our logic of “People who hit first_open and then use_extra_steps in that order” from the first half of the article. Again, the “right” answer here depends on what you’re trying to measure.

So which way is better, between the LEFT JOIN approach and the LEAD OVER method? I can’t really say for sure. I’m no expert in BigQuery efficiency, but it looks like the LEAD OVER approach ended up taking less time overall when I ran this query over 30 days of Analytics data.

But as always with BigQuery, I find there are multiple ways to do things, and I’m usually wrong. So if you have any other thoughts on how to approach this differently, let me know in the comments below!

Also, if you made it all the way to the end of this really gnarly topic, you win five Internet Bonus Points! I promise future topics will be a little less involved.

--

--

Todd Kerpelman
Firebase Developers

Former Game Designer. Current Developer Advocate at Google. Dad. Often sleep deprived.