Computing Session for Event Data Using Google BigQuery

Massy Bourennani
The Startup
Published in
4 min readJul 27, 2020

Introduction

In this article, we will show how to compute the sessions IDs of event data using Google BigQuery (it can be done in any SQL:2011) and apply that on a use case: programming sessions in some open source projects.

First, let’s define the notion of a session window.

Session window

We say that two successive events are in the same session if the delay between the two doesn’t exceed a parameter p, the inactivity period.

For example, we define a session with a parameter pof inactivity of 2 minutes. And we have three events:

Listing-1: Example of event data

event 1and event 2are in the same session because 2020–07–23T12:01:00 UTC — 2020-07-23T12:00:00 UTC <= 2 minutes . However event 3and event 2 are not in the same session because 2020–07–23T12:05:00 UTC — 2020-07-23T12:01:00 UTC > 2 minutes

Why do we need to know that ?

Sometimes the events that are generated don’t have a notion of session so the computation of the session ID needs to be done afterwards. In other use cases the parameter p may vary: the definition of p is different for a chess game (where users are supposed to take more time before being active) and Fruit Ninja (where users need to be fast).

The knowledge of sessions is very important because it gives us insights on user engagement: a long average session means that users are staying longer on the app.

Problem statement

Given a table of event data and a parameter p of inactivity we want to compute the session ID of events. An example of input is given in Listing-2.

Listing-2: Event data

The session ID of Listing-2 is given in Listing-3.

Listing-3: Event data with session ID

How do we do that ?

First let’s consider the table in Listing-2 where each line represents an event with:

  • creation_ts which defines the timestamp when it occurred.
  • user_id which defines the user that generated the event.
  • event_id identifies the event among others.

Let’s put the inactivity period p of the session to 2 minutes. Figure-1 is an illustration of Listing-2.

Figure-1: Event data

We can say, from the definition, that a session ends at event e_i when the next event e_j happens after p, 2 minutes in our example. So, We need first to compute the amount of time that passed between two successive events. A representation of the algorithm is given in Figure-2.

Figure-2: Events with session ID

We can see from Figure-2 that the period of time separating two successive sessions is greater than 2 minutes. The session ID of event e_i can be thought of as the number of red dashed lines seen so far for a particular key, the user_id in our example.

Let’s consider the code that computes the result of Listing-3.

Listing-4: Computing session ID from event data

The first common table expression (CTE) only gives us input data. events_with_inactivity CTE computes the period separating two successive events thanks to the LAG function which returns for every event e_i the creation_ts of the preceeding event e_j . The order and partitioning LAG is relying on is defined by the WINDOW clause: lines are partitioned using the user_id (that way the session_id is local to a user and not global to the whole table) and are sorted in non-decreasing order according to creation_ts .

In events_with_new_session CTE session “breaks” are computed (like the red dashed line in Figure-2). is_new_session is 1 when inactivity_period is greater than 2 , it’s 0 otherwise.

The local session_id is computed by doing a cumulative sum on is_new_session . The window definition is the same as in events_with_inactivity CTE.

The global session_id, in the sense that there are no two lines that have the same session_id and a different user_id , is computed using a hash on a string composed of user_id and the local session_id

Analysis of Github commit sessions

In this section we are going to study duration of commit sessions, which can be seen as a programming session, in some open source project, namely: Apache Airflow, Apache Spark, Apache BEAM, Kubernetes, Numpy and Kubeflow.

As shown in Figure-3, the events are the commits. They are ordered by the timestamp when the commit was made. The session is computed with regards to the repository (not the user_id).

Figure-3: An example of a programming session on some open source projects

We consider bigquery-public-data.github_repos.commits table which is in BigQuery’s public dataset. We get only the commits we need, more details in Listing-5.

Listing-5: GitHub commits of open source projects
Listing-6: Extract of some commits

What inactivity period to chose ?

The first and one of the most important decisions to make in our analysis is to chose the inactivity period, the one that is used to decide if two successive commits are in the same programming session. We chose the median of inactivity period which is 32 minutes.

Listing-7: Median of inactivity period (excluding lines with 0 duration)

What is the average duration of a programming session per project ?

Listing-8: Average duration of a programming session per project (excluding one commit sessions)
Listing-9: Results of average duration of a programming session per project

What is the longest programming session per project ?

Listing-10: Longest programming session per project
Listing-11: Result of longest programming session per project

Conclusion

In this article we defined the notion of session and how to compute session IDs for event data using Google BigQuery. We applied that to the analysis of github commits for some open source projects.

One of the most important things to define before studying sessions is the inactivity period because it’s related to the product: the inactivity period of an FPS game is not the same as the one for a puzzle game.

Resources

--

--