Measuring Engaged Time from Discrete Events

Published in

Udemy Tech Blog

8 min readNov 20, 2021

Summary

At Udemy we are interested in measuring user engagement with course components like quizzes. In order to measure engaged time by user, we cluster closely-spaced click events into sessions. This technique can be applied to any situation where you want to measure engaged time, but only have access to discrete events.

Background and Motivation

“Engaged time” is an important metric for many websites like Udemy; it can show you where users spend their time, and where and when they are losing interest. A news site, for example, would be interested in knowing which articles capture the most attention. We use engaged time as a metric for two reasons:

Internal analytics: we want to measure how much time users spend on courses and product features as a gauge of user interest, to see whether it correlates with positive learner outcomes, to provide feedback to instructors, and to get a feel for how long typical course units take to complete.
Instructor payments: payment to an instructor with courses in one of our subscription products is proportional to the amount of time users spend on that instructor’s courses.

Many web frameworks record data in logs of discrete events that fire either periodically (a ping from the web host to the user), or upon user action (like clicking a button). This shows you a collection of snapshots of a user at various points in time, but how do you get from there to estimating how long that user was engaged?

One activity that we’re interested in measuring at Udemy is user engagement with quizzes within our courses. As an illuminating example, this blog post will walk through how we measured engaged time with quizzes. Our quizzes typically have several questions and can be completed in 10–20 minutes. But we wanted to know what fraction of users were struggling, or giving up early, or even just randomly guessing at the questions and finishing in 30 seconds. One solution would be to implement a timer on the client side and then have that timer report the time upon quiz completion. But this wasn’t available to us for technical reasons. The data we did have was a log of discrete events, with timestamps, that fired each time a user submitted a quiz question.

Goal: Engagement Sessions

When a user is taking a multi-question quiz, we receive an event every time the user submits a quiz answer. But we have to guess at what the user is doing between events. Did the user walk away from her computer, or was she reading a question or thinking about an answer? We’ll assume that events that are close together are from the same session, and that a long gap means the user was not active. The process of converting discrete events into sessions representing user activity is called sessionization. When measuring minutes consumed for a given user, we simply sum up the session durations as illustrated in the below diagram.

This approach raises two questions — the first definitional and the second operational:

How close together do events have to be to be part of the same session?
How to efficiently measure time between events?

These questions are addressed in the sections that follow.

Setting Parameters: How close together do events have to be to be part of the same session?

How long after the last event was the user active? Let’s establish two parameters:

Patience: how long to wait until the next event before declaring the session over.
Buffer: how long to extend this session after the last event.

How should we set the parameters? There is some flexibility here. We decided to set patience to be long enough to capture ~90% of quiz event pairs. We set the buffer to be roughly the median time between events, on the assumption that engagement after the last event is similar in duration to typical engagement between events. Looking at a sample of quiz submission events, we drew up a cumulative distribution function for time between event pairs (see figure below — exact times obscured).

How to efficiently measure time between events?

In this section we walk through how to take a stream of discrete events and turn them into session intervals with a start and end time. We begin with an event log SQL table called ‘events,’ sorted by timestamp, as in the table below. In this example, each timestamp marks the time that the user submitted a quiz answer. Then the goal is to end up with a SQL table of session intervals by user called ‘sessions.’ See tables below:

╔════════════╗     ╔═══════════════╗
║ events     ║     ║ sessions      ║
╠════════════╣     ╠═══════════════╣
║ user_id    ║ --> ║ user_id       ║
║ event_time ║     ║ session_start ║
╚════════════╝     ║ session_end   ║
                   ╚═══════════════╝

Naive Approach

To extract sessions from events, a naive approach would be to join the events table to itself and calculate the pairwise time between each pair of events. Then assign events that are less than {patience} minutes apart to the same session.

Time between events for a single user:
╔═════════╦═════════╦═════════╦═════════╗
║         ║ Event 2 ║ Event 3 ║ Event 4 ║
╠═════════╬═════════╬═════════╬═════════╣
║ Event 1 ║ 15s     ║ 45s     ║ 90s     ║
╠═════════╬═════════╬═════════╬═════════╣
║ Event 2 ║         ║ 30s     ║ 75s     ║
╠═════════╬═════════╬═════════╬═════════╣
║ Event 3 ║         ║         ║ 45s     ║
╚═════════╩═════════╩═════════╩═════════╝

To calculate time between events in SQL would look something like this:

SELECT
  e1.user_id,
  e2.event_time — e1.event_time
FROM events e1
JOIN events e2
  ON e1.user_id = e2.user_id
  AND e2.event_time > e1.event_time

With 10 events we would have 100 pairs. Udemy tends to generate more than 1,000,000 events per day — that makes for a lot of pairs! This also happens to be what computer scientists mean when they say O(n²). O(n²) compute time is very slow. We can do better.

Rolling Approach to Sessionization

This more efficient approach only cycles through the event table data for a fixed number of times, so it takes O(n) time to compute. It’s adapted from Rolling Growth Metrics in SQL by Chuong Do.

For this example, assume patience is set to 3 minutes and buffer is set to 15 seconds.

1. Active Windows. Each event becomes a 3-minute interval, during which the user is assumed to be active.

2. Rolling count. At each event timestamp, sum up a rolling count of concurrent active windows.

3. Rolling States. Binarize to true/false whether the user is currently active.

4. Deduplicate States. Only keep a row if it is different from the row before it.

5. Convert to intervals. One row per interval, with start and end time.

6. Subtract patience, add buffer. Session ends 15s after the last event.

In this example, we end up with two intervals approximating the ground truth activity of the user. Notice that the second interval is shorter than the ground truth interval. Since we only had one event to work with, we had to approximate.

Improving Accuracy

A final note: you can improve the accuracy of your session estimates with more frequent event fires. The more often you can check in with users, the less you have to guess what they are doing. The tradeoff is that more frequent event fires may slow down site loading speed.

Conclusion

We hope this post is a useful guide and saves you trouble! At Udemy we have found this sessionization method useful not only for getting from discrete events to sessions; we also applied it at a larger scale to define a set of active users. If we define an “active” user as one who visited the site at least once in the last 30 days, we can apply the same procedure above where the discrete event is a site visit and the patience and buffer are each set to 30 days.

At Udemy, we’re constantly trying to create new possibilities for individuals and organizations everywhere, and we are always looking for curious individuals to help us achieve our mission. Checkout our jobs page if you’re interested in joining!

Appendix: SQL Code

The six-step diagram above is translated into some simplified SQL code to get you started.

WITH/*
Step 1: Each event becomes a 3-minute interval, when the user is assumed to be active. 
*/
active_windows AS
(
   (
       SELECT
         user_id
         , event_time
         , 1 AS activity_delta
       FROM events
   )

   UNION ALL

   (
     SELECT
       user_id
       , event_time + {patience} AS event_time
       , -1 AS activity_delta
     FROM events
   )
)/*
Get sum so there's one row per session per time.  This is just in case there is a +1 and a -1 at the same time.
*/
, consolidate_simultaneous_events AS
(
   SELECT
       user_id
       , event_time
       , SUM(activity_delta) AS activity_delta
   FROM active_windows
   GROUP BY
       user_id
       , event_time
)/*
Steps 2 and 3: Binarize activity counts so that each event time is logged as active or not active.  The result will be for each unit, a string of T's until an F marking that the session is no longer active. 
*/
, rolling_state AS
(
   SELECT
       user_id
       , event_time
       , SUM(activity_delta) OVER (
           PARTITION BY
               user_id
           ORDER BY event_time ASC
           ROWS UNBOUNDED PRECEDING
       ) > 0 AS active
   FROM consolidate_simultaneous_events
)/*
Step 4: For each row, look for earlier states that are the same as this one. If found, mark this row as redundant.  
*/
, deduplicate_states AS
(
   SELECT
       user_id
       , event_time
       , active
       , COALESCE(
           active = LAG(active, 1) OVER (
               PARTITION BY
                   user_id
               ORDER BY event_time ASC
           )
           , FALSE
       ) AS redundant
   FROM rolling_state
)/*
Step 5: At this point there should be two event times for each session.  The first is the session start time and the second is the end time.  If there is no second time, assume user reached end of day and use 11:59:59pm. 
*/
, raw_interval AS
(
   SELECT
       user_id
       , event_time AS first_event_time
       , COALESCE(
           LEAD(event_time, 1) OVER (
               PARTITION BY
                   user_id
               ORDER BY event_time ASC
           ), {end_of_day}
       ) AS last_event_time
       , active
   FROM deduplicate_states
   WHERE NOT redundant
)

/*
Step 6: Subtract the {patience} value that was added on to the last event time, and add the {buffer} value.
*/
SELECT
   user_id
   , first_event_time AS start_time
   , last_event_time - {patience} + {buffer} AS end_time
FROM raw_interval
WHERE active