Rolling growth metrics in SQL

Chuong Do
6 min readSep 25, 2017

--

At Coursera, growth metrics are one of the many tools we use to understand business health. Measures like the number of active users in the past 28 days, or activation, retention, and reactivation rates provide a simple indicators of business performance. When combined with driver analysis, growth metrics provide a data-driven way to prioritize internal company investments based on potential impact on business outcomes.

Practically, however, SQL queries for computing growth metrics on a daily rolling basis are often very slow. Although waiting a little bit longer is generally the lowest effort alternative, in my own experience, I’ve seen time-and-again the negative impact that slow computations can have on the creative process for data analysts and the likelihood of identifying meaningful data insights.

In this technical post, I’ll share a few tricks we use at Coursera to make calculating rolling growth metrics fast and easy. Hopefully, you’ll find these techniques applicable and useful in your own work!

Image by Luis Llerena via unsplash.com

Naive approach

To start with, suppose we have the following three tables:

Furthermore, we’ll define the following four 28-day trailing growth metrics of interest:

  • active users: The number of unique users active (at least once) in the last 28 days.
  • first-month activation rate: The proportion of new registrants from the last 28 days who have been active at least once.
  • month-to-month retention rate: The proportion of active users in the last 28 days among those who were active in the preceding 28-day period (i.e., between 55 and 28 days ago).
  • month-to-month reactivation rate: The proportion of active users in the last 28 days among those who were not active in the preceding 28-day period.

The following monolithic query conveniently computes all of these metrics in one fell swoop:

This query works by first precomputing all of the dates on which each user might be considered trailing 28-day active. Given these dates, the main query joins the calendar table with the trailing_activity table twice to determine which users were present in the current and previous 28-day periods. The use of the calendar table ensures that the query doesn’t skip days with no active users (a common occurrence when restricting the above query to specific small user cohorts).

An alternative approach

The key idea we’ll use to speed up the basic query is as follows. Instead of precomputing each date on which each user is 28-day active, we’ll precompute all “date intervals” over which each user is 28-day active:

For example, suppose that a user is active on days 32, 33 and 37. This activity would be represented in the table above as a single row with start_dt = 32 (since day 32 is the first day with activity) and end_dt = 64 (since the user’s activity on day 37 would still be within the 28-day trailing window as of day 64). However, the user would be 28-day inactive as of day 65.

To get an intuitive sense for what this table looks like, consider the following diagram, in which each unique user is represented as a single row. Black segments represent date intervals over which the user is 28-day active, and blue segments represent date intervals over which the user is 28-day inactive. The left endpoint of each row represents the date on which the user’s account was created, and the right endpoint represents the last full day with activity data for all users (which we’ll assume to be CURRENT_DATE - 1):

The rows of trailing_activity_intervals correspond to the black segments. Computing 28-day actives on any given date involves counting the number of active intervals (black segments) that overlap that date (as represented by the dashed vertical line). Other growth metrics can be computed similarly.

In code, the rolling calculations are similar to what we had previously, but using a BETWEEN join on trailing_activity_intervals instead of an equality join:

So, how do we compute trailing_activity_intervals?

We’ll use the following 4-step process:

  1. Define events. For each (user_id, activity_dt) in the daily_activity table, we’ll create a pair of “events” corresponding to the date when the activity occurs (and hence should first be counted as part of a 28-day rolling window), and 28 days later when the activity expires (after which the activity no longer counts toward any subsequent 28-day rolling window). We’ll mark these two events with scores +1 and -1 (to be used in the next step). We’ll also create an event with score 0 on the user’s registration date.
  2. Compute rolling states. By computing the cumulative sum of the scores for each event date, we’ll know the number of days the user was active over the trailing 28-day period. Checking if this sum was greater than zero indicates whether the user was 28-day active as of each event date.
  3. Deduplicate states. For parsimony, we’ll remove redundant events where the user’s 28-day activity status has not changed.
  4. Convert to intervals. Finally, we’ll convert the remaining events into activity intervals.

These four steps are illustrated below:

Here’s what this looks like in code:

The benefits

Precomputing an interval-based table does complicate the code a bit, but here’s why we have found it useful at Coursera:

  • It’s fast. For 28-day trailing activity, our approach is a bit faster than the naive approach even when including the precomputation costs. On our data, for example, typical timings are 73s (precomputation) + 21s (improved query) < 161s (naive approach); when computing only active users, then the improved query can be simplified to run in 5s. But the real benefit here comes from materializing the precomputed table once and reusing it in multiple (interactive) analyses.
  • It’s space efficient. Technically, you could also try to materialize trailing_activity directly, but this wastes a lot of space. Because of the compression provided by the interval representation, trailing_activity_intervals has roughly 46x fewer rows in our data than trailing_activity (and roughly 6x fewer rows than daily_activity).
  • It scales well. Because the pre-computation of trailing_activity_intervals only touches event dates when activity status might change, it doesn’t slow down for longer rolling windows (e.g., 365 days) — in fact, it actually gets faster since the resulting table tends to have fewer rows. In contrast, the BETWEEN join used in the naive approach gets increasingly expensive (both timewise and spacewise) as the rolling window length increases.
  • It’s extensible. The interval approach can be extended to compute other types of metrics efficiently as well. For example, suppose you were interested in the number of active users in the past 28 days who had been active on at least 7 different days. This can also be done with interval tables (see Appendix). For Coursera, interval tables are an easy way to keep track of a variety of per-user rolling metrics such the number of recent enrollments, the number of lecture videos recently watched, the number of assignments recently submitted, or the amount of money recently spent.
  • It’s convenient. Beyond the efficiency benefits, perhaps the best thing about precomputed interval tables is their convenience. Interval tables simplify queries that would otherwise require carefully thinking through tricky BETWEEN joins or window functions.

So that’s it! I hope you find this technique useful. Happy querying!

Appendix: Computing Xd28+ engagement metrics

Let’s precompute a table that makes Xd28s+ (i.e., the number of users who have been active at least X days in the past 28) computation easy for any value of X:

Then, 7d28s+ can be computed using the following query:

--

--