Range-user-retention. What is it? Why use it? How to calculate it?

How organizations know who retains and when.

Raphaël Vannson
Slalom Technology
7 min readFeb 23, 2019

--

Credits: Jehyun Sung

Organizations spend a lot time wondering if they “will see you again” or if they “have been seeing you again”. This post discusses range retention — a popular way to evaluate user retention — why it makes sense and how to calculate it. A link to a complete implementation in a Jupyter notebook is provided in the last section.

1. What is range retention?

1.1 Simple question… simple answer?

Retention is a metric used by organizations to evaluate what fraction of their members come back after first contact. For example: brick and mortar shop owners want to know if customers come back after their first purchase, web platforms want to know if users visit again after they have created an account, etc… So what’s the big deal I hear you ask? Sounds like retention can easily be defined as a single number given by this formula:

1.2 Nope, not so simple.

The problem with this definition of retention is that it is a summary over all users since the organization’s first day of business. It does not say much about the organization’s current ability to get its members to come back, nor does it report the effect of past actions taken by the organization to improve retention.

1.3 Range retention to the rescue!

One way to address this issue is to get more granular: instead of a single number, retention is defined as a table of multiple retention rates. This table tells the story of how multiple user groups (cohorts) retain over time — this is called range retention. In simple terms, it tells us who retains and when.

How does it work? First, a period of interest is selected (ex: last 12 months), that is the period for which we are going to get a retention story. Second, this period is broken down into smaller windows of time (ex: weekly or monthly), (that is how we achieve the when). Finally users are grouped into cohorts, (that is how we achieve the who). Then all we have to do is calculate the retention rate of each user cohort during each window.

Here is a simple example: say our period of interest is last year, broken down by month (the windows) and that a user’s cohort is the month when they made first contact. Then range retention is going to be a table containing these retention rates:

  • Users who made first contact in January: what proportion came back 1, 2 … 11 months later?
  • Users who made first contact in February: what proportion came back 1, 2 … 10 months later?
  • (…)
  • Users who made first contact in November: what proportion came back 1 month later?

You can see we are about to calculate a triangular table containing relevant retention rates in the cells shaded in green and NAs in elsewhere.

2. Calculating the range retention table

This section, I will detail an approach to calculate the range retention table. To make things less theoretical, we will be using an example: we want to get to the retention story of a ficticious store by analyzing its orders data.
Our period of interest is the year 2004, our windows are going to be monthly and a user cohort is the month when they made their first purchse at the store.

Note: data has been filtered to retain only customers who made first contact in 2004, we want to limit our analysis to new customers.

2.1 Raw data

Let’s start with a table containing the orders data. While all we need for our pupose is a customer ID and an order date, our raw orders data has more information and looks like this:

2.2 Add column capturing the window ID

It all starts simply by creating one column capturing the window ID. For our use case, this is going to be the month extracted from theorderdate.

2.3 Define the cohorts

Let’s assign each user to a cohort. For us the cohort ID is going to be the month of the first purchase so we will name our cohort ID column join_month (the month the user joined the store). We create a table mapping each unique customerid with its join_month.

2.4 Get the cohort sizes

Next, we want to get the number of unique customerid in each cohort. The result is a simple table containing 12 rows, one per join_month (the cohort ID).

Note: while there is a cohort for month 12, we will not be reporting any retention values for it since we are looking at 12 months of data (we would need to have 13 months to report how cohort #12 is doing on month 12 + 1).

2.5 Assign a cohort ID and activity index to each interaction

The next step is to go back to the raw orders data and add a couple of useful columns to it. For each order row, we will want to add:
join_month (based on the customerid and the first table we created) and
activity_index (based on order date month and joined_month).

We will filter out rows where activity_index = 0. The activity_index captures how many windows after join_monththe order was placed at. Example: for a join_month of Janurary (1), an order placed in March will have an activity_index of 2 (activity is 2 months after first contact).

We now have a table containing one row per order with columns identifying which cohort placed the order (join_month) and the order date as a number of months after first contact (activity_index).

2.6 Get the activity size

With a bit of aggregation, it is easy to count the number of customers of each cohort who were active n months after joining.

Note: if some customers of a particular cohort did not return at all during an entire month you would want to fill in the blanks to ensure you have a 0 for the missingjoin_month and activity_index combinations (although you may no longer have a business at this point…).

2.7 Form the range retention table

This is a last part, we join the counts (cohort size and activity size) and make a simple division to get the retention percentages.

For the final touches: pivot the table, add some color and you’re done!

Note: the 0s in the lower triangle are in fact NAs which have been replaced to allow for the nice cell-by-cell shading, see this issue for details.

3. Turning the table into a retention story

Let’s look at the first row, it deals with a specific cohort of users: the customers who made their first purchase in January. This cohort has 979 distinct customers. The green values can be read from left to right to see which percentage of these customers made (at least) one purchase 1, 2, … 11 months after first contact (4.39% , 4.7%, …5.01%) — that is the evolution of retention for this particular user cohort over the period of interest (2004). Realistically, for our store use case, we want this to stay leveled over time after an inital decrease. But we can imagine a company in the business of releasing products with a limited lifetime (ex: free mobile games). In this case, it is normal and acceptable to lose users steadily during the lifetime of the product (as long as the next product which will attract users again is released soon enough).

Now, let’s look at the first green column from the left, instead of being for a fixed cohort, it is for a fixed activity_index. That is quite useful to see if customers come back one month after their first purchase and how that is evolving throughout the period of interest (2004). Ideally, this is something we slowly but constantly get better at.

4. The best stuff is always free

Wanna get hands-on and see some code? We’re glad you asked! This Notebook provides a complete code example in Python (use Github’s “try again” if the notebook fails to load — don’t ask me why this happens sometimes). The easiest way to get all the artifacts locally is to clone this Github repo.

$ git clone https://github.com/slalom/medium-rangeretention.git

Thanks for reading. Please come again!

--

--

Raphaël Vannson
Slalom Technology

Analytics engineer and wannabe scientist. I design and implement statistical analyses and dashboards (ideally on big data).