Using Google BigQuery for Event Tracking

Until recently, we were using some of the popular analytics tools (Mixpanel and Google Analytics) to track the usage of Streak. The visual interfaces provided by these tools make it easy for non-technical users to see trends in the usage of their websites. However, this ease-of-use comes at a price: you can’t ask arbitrary queries. We found ourselves asking questions that were not answerable in the one-size-fits-all interfaces provided by these tools.

Thus, after Google BigQuery’s recent addition of realtime row ingestion, we decided to move our event log over to BigQuery. We have found BigQuery to have several benefits over our previous solutions:

  1. We can ask sophisticated queries that were not possible in the other tools, and get answers on historical data immediately.
  2. It is cheaper than the other solutions, so much so that we put all of our data into BigQuery without sampling. We sampled our data previously to save money.
  3. We can join our event data to the rest of the data we have in BigQuery: user data, server logs, and so on.

As mentioned earlier, we use the streaming ingestion feature of BigQuery, this allows us to send just a few events at time to our event tracking tables in BigQuery. This is awesome because there is no need to store these events in temporary storage and then batch them together into one file and upload to an analysis tool. Even better, these events are ready to be queried within minutes of streaming them.

However, we don’t send the events directly from the client to BigQuery, instead we bounce them off our server. We do this so that our server can add additional information to these events (ip address, email address, userKeys, etc). It’s also good because we don’t have to expose our auth tokens to the individual clients (our web app, iOS app and android app).

But what about advanced queries?

When we first explored BigQuery as an alternative, we were unsure whether we would be able to use it for funnel analysis, i.e. to be able to determine what percentage of users did a sequence of events, or a prefix of those events. It turns out that BigQuery can handle funnel queries just fine, and in fact it is easy to run funnel queries on subsets of the data as well.

As an example, let’s say we want to know what percentage of users did the following sequence of actions, or started to do those actions and then stopped:

  1. visit our homepage
  2. install the Streak extension
  3. sign in to Streak

For simplicity, let’s say that our events log, [events.log], has the following schema:

timestamp: INTEGER — a UNIX timestamp
event: STRING — the name of the event
sessionId: STRING — a unique identifier for the user’s session

And we have the following sessions and events:

Session 1: visit homepage, visit homepage
Session 2: visit homepage, install extension
Session 3: visit homepage, install extension, sign in

Represented by the following table:

╔═══════════╦═══════════╦══════════════╗
║ timestamp ║ sessionId ║ event ║
╠═══════════╬═══════════╬══════════════╣
║ 1 ║ S1 ║ visitHome ║
║ 2 ║ S2 ║ visitHome ║
║ 3 ║ S2 ║ install ║
║ 4 ║ S3 ║ visitHome ║
║ 5 ║ S3 ║ install ║
║ 6 ║ S3 ║ signIn ║
║ 7 ║ S1 ║ visitHome ║
╚═══════════╩═══════════╩══════════════╝

We want a query that will take that table and give us a single row where each field represents the number of unique sessions that reached that step:

╔══════════╦═══════════════════╦═════════════════╦════════════════╗
║ sessionId║ visitHomeTimestamp║ installTimestamp║ signInTimestamp║
╠══════════╬═══════════════════╬═════════════════╬════════════════╣
║ S1 ║ 1 ║ null ║ null ║
║ S1 ║ 7 ║ null ║ null ║
║ S2 ║ 2 ║ 3 ║ null ║
║ S3 ║ 4 ║ 5 ║ 6 ║
╚══════════╩═══════════════════╩═════════════════╩════════════════╝

Then, we could write a query as follows:

SELECT COUNT(firstVisitHomeTimestamp) AS numVisitHome,
COUNT(firstInstallTimestamp) AS numInstall,
COUNT(firstSignInTimestamp) AS numSignIn
FROM (SELECT sessionId,
visitHomeTimestamp,
installTimestamp,
signInTimestamp
FROM steps1_2_3
GROUP BY sessionId)

The inner SELECT takes the results of steps1_2_3 and gives us with one row per session, with the first time an event happened during the session. In our example, this gives:

╔══════════╦═══════════════════╦═════════════════╦════════════════╗
║ sessionId║ visitHomeTimestamp║ installTimestamp║ signInTimestamp║
╠══════════╬═══════════════════╬═════════════════╬════════════════╣
║ S1 ║ 1 ║ null ║ null ║
║ S2 ║ 2 ║ 3 ║ null ║
║ S3 ║ 4 ║ 5 ║ 6 ║
╚══════════╩═══════════════════╩═════════════════╩════════════════╝

Then, the outer subquery counts up the number of non-null rows for each event, giving us our final desired result.

Thus, we have reduced our problem to producing steps1_2_3.

We can create a subquery to determine the visitHome events:

(SELECT sessionId as sessionId1,
timestamp as timestamp1
FROM [events.log]
WHERE name = "visitHome") AS step1

And similarly the other install events:

(SELECT sessionId as sessionId2,
timestamp as timestamp2
FROM [events.log]
WHERE name = "install") AS step2

And finally the other signIn events:

(SELECT sessionId as sessionId3,
timestamp as timestamp3
FROM [events.log]
WHERE name = "siginIn") AS step3

Then we want to left-join that with the install event:

(SELECT sessionId1,
timestamp1,
timestamp2
FROM step1
LEFT JOIN step2
ON sessionId1 = sessionId2 AND timestamp1 < timestamp2

However, BigQuery does not currently allow for inequalities in LEFT JOIN ON clauses. We can use IF expressions instead:

(SELECT sessionId1,
timestamp1,
IF(timestamp1 < timestamp2, timestamp2, NULL) as timestamp2
FROM
(SELECT sessionId1,
timestamp1,
timestamp2
FROM step1
LEFT JOIN step2
ON sessionId1 = sessionId2)
) AS steps1_2

Now do the same for the signedIn event:

(SELECT sessionId1 as sessionId,
timestamp1 as visitHomeTimestamp,
timestamp2 as installTimestamp,
IF(timestamp2 < timestamp3, timestamp3, NULL) as signInTimestamp
FROM
(SELECT sessionId2,
timestamp2,
timestamp3
FROM steps1_2
LEFT JOIN step3
ON sessionId1 = sessionId3)
) AS steps1_2_3

Obtaining the subquery we desire!

Note: for large tables, you’ll want to use LEFT JOIN EACH and GROUP BY EACH to use BigQuery’s more scalable (but more resource-intensive) algorithms. Also note that it is possible that other events are interleaved between the events we are querying for, which is not an issue for us but it something to be aware of when using this technique.

Funnel queries are complex (imagine a 10-step funnel), and writing them by hand is cumbersome. To make this easy, we added functionality to the Streak BigQuery Developer Tools Chrome extension (which is open source). Using the extension, you go to the BigQuery Web UI and write something like:

funnel
table [events.log]
timestampColumn timestamp
joinColumn sessionId
nameColumn event
steps visitHome,install,signIn

Then press the Expand Templates button to generate your funnel. You can graph the results using the extension as well. We hope you find this useful.