Computing Session for Event Data Using Google BigQuery
Introduction
In this article, we will show how to compute the sessions IDs of event data using Google BigQuery (it can be done in any SQL:2011) and apply that on a use case: programming sessions in some open source projects.
First, let’s define the notion of a session window.
Session window
We say that two successive events are in the same session if the delay between the two doesn’t exceed a parameter p
, the inactivity period.
For example, we define a session with a parameter p
of inactivity of 2 minutes. And we have three events:
event 1
and event 2
are in the same session because 2020–07–23T12:01:00 UTC — 2020-07-23T12:00:00 UTC <= 2 minutes
. However event 3
and event 2
are not in the same session because 2020–07–23T12:05:00 UTC — 2020-07-23T12:01:00 UTC > 2 minutes
Why do we need to know that ?
Sometimes the events that are generated don’t have a notion of session so the computation of the session ID needs to be done afterwards. In other use cases the parameter p
may vary: the definition of p
is different for a chess game (where users are supposed to take more time before being active) and Fruit Ninja (where users need to be fast).
The knowledge of sessions is very important because it gives us insights on user engagement: a long average session means that users are staying longer on the app.
Problem statement
Given a table of event data and a parameter p
of inactivity we want to compute the session ID of events. An example of input is given in Listing-2.
The session ID of Listing-2 is given in Listing-3.
How do we do that ?
First let’s consider the table in Listing-2 where each line represents an event with:
creation_ts
which defines the timestamp when it occurred.user_id
which defines the user that generated the event.event_id
identifies the event among others.
Let’s put the inactivity period p
of the session to 2 minutes. Figure-1 is an illustration of Listing-2.
We can say, from the definition, that a session ends at event e_i
when the next event e_j
happens after p
, 2 minutes in our example. So, We need first to compute the amount of time that passed between two successive events. A representation of the algorithm is given in Figure-2.
We can see from Figure-2 that the period of time separating two successive sessions is greater than 2 minutes. The session ID of event e_i
can be thought of as the number of red dashed lines seen so far for a particular key, the user_id
in our example.
Let’s consider the code that computes the result of Listing-3.
The first common table expression (CTE) only gives us input data. events_with_inactivity
CTE computes the period separating two successive events thanks to the LAG function which returns for every event e_i
the creation_ts
of the preceeding event e_j
. The order and partitioning LAG is relying on is defined by the WINDOW clause: lines are partitioned using the user_id
(that way the session_id
is local to a user and not global to the whole table) and are sorted in non-decreasing order according to creation_ts
.
In events_with_new_session
CTE session “breaks” are computed (like the red dashed line in Figure-2). is_new_session
is 1
when inactivity_period
is greater than 2
, it’s 0
otherwise.
The local session_id
is computed by doing a cumulative sum on is_new_session
. The window definition is the same as in events_with_inactivity
CTE.
The global session_id
, in the sense that there are no two lines that have the same session_id
and a different user_id
, is computed using a hash on a string composed of user_id
and the local session_id
Analysis of Github commit sessions
In this section we are going to study duration of commit sessions, which can be seen as a programming session, in some open source project, namely: Apache Airflow, Apache Spark, Apache BEAM, Kubernetes, Numpy and Kubeflow.
As shown in Figure-3, the events are the commits. They are ordered by the timestamp when the commit was made. The session is computed with regards to the repository (not the user_id).
We consider bigquery-public-data.github_repos.commits
table which is in BigQuery’s public dataset. We get only the commits we need, more details in Listing-5.
What inactivity period to chose ?
The first and one of the most important decisions to make in our analysis is to chose the inactivity period, the one that is used to decide if two successive commits are in the same programming session. We chose the median of inactivity period which is 32 minutes.
What is the average duration of a programming session per project ?
What is the longest programming session per project ?
Conclusion
In this article we defined the notion of session and how to compute session IDs for event data using Google BigQuery. We applied that to the analysis of github commits for some open source projects.
One of the most important things to define before studying sessions is the inactivity period because it’s related to the product: the inactivity period of an FPS game is not the same as the one for a puzzle game.
Resources
- github archive https://www.gharchive.org/#bigquery