User retention with cohort analysis

Split your users into weekly/monthly cohorts and calculate the retention for each period

Andreas Tsangarides
Multiply
6 min readSep 18, 2019

--

This is the second part of the series Mobile Analytics using Python, describing how to calculate your key app usage metrics using nothing but python.

Code snippets provided in each section, and the Github repo can be found here.

Objective

What we are trying to calculate in this section are the:

  1. Size of each user cohort (new users per period)
  2. Number of returning users per subsequent period for each cohort
  3. Percentage of returning users per subsequent period for each cohort

Why bother?

Retention analysis is key to understanding if users come back to your app and at what frequency. Inevitably, the percentage of users coming back to the app will decrease with time after their acquisition as some users fail to see the value of the app or maybe they just don’t need it or did not like it.

User Retention

It‘s in your best interest to avoid a steep initial drop-off, and try to have the asymptotic part of the curve as far away from 0 as possible, in order to have as many engaged users as possible.

A single curve, in most cases, would not be insightful enough, as users join at different time periods, before or after certain app features were released, so their first interaction with the app will be different. So retention analysis is more useful when combined with cohort analysis.

In this section, the cohorts will be defined using the user acquisition period. Different insights can be generated by using different cohorts, e.g. segment users depending on some user features, or if they did a specific action as opposed to using the period during which they signed up/got acquired.

Starting Point

Again, I will assume that you have your data in a pandas DataFrame, named events, with the following columns:

distinct_id: distinct user id
name: event name
properties: a dict object with the properties associated with the event
time: timestamp at which the event took place

Dummy data for illustration purposes (Note: “properties“ not needed for this section)

Requirements

  • Python (I am using v3.7)
  • Pandas: for data manipulation
  • Matplotlib, seaborn: for visualisations*
  • A note on matplotlib: Make sure you installed this before you installed pandas, as the other way round might cause rendering issues.

References/ Logic

I will be using Greg Reda ’s methodology as the foundations for the main function that will produce the retention stats table. He has a very clear tutorial with output at each step, so I will not repeat every single step here.

I have adjusted some of the code, however, to my liking and to address an assumption he has made regarding the data (that we have at least one user acquisition per period and at least one action per period per user cohort; this might not necessarily be true at the early stages of a startup). I have also added the extra functionality of being able to measure retention based on a specific event or list of events, as opposed to any user action.

Code

I will highlight key steps and provide output to help to visualise what we are doing, and then everything will be wrapped up in functions. Code snippets in image format are only included here to breakdown the process into steps. All these snippets are included in the functions that you can copy.

1) Calculate the size of each user cohort

Using the function we created in the previous section, we can get the acquisition time for each user and the user_active column. This allows us to calculate the number of acquired users per period.

2) Calculate the number of active users per event_period for each cohort

We need to filter the DataFrame for only acquired users and for events that took place after the acquisition time, and count the number of users per event_period per cohort.

Note that both event_period and cohort are still dates.

3) Fill in missing combinations of cohort and event_period columns

This step needs to be done to ensure that in the event of a cohort for which there were no users active in any of the subsequent periods, then that period will be filled with ‘0’ as opposed to reporting the next period with actions and causing a misalignment in our final retention table.

For example, if during the week starting on ‘2018–10–15’, there were 10 users acquired, and out of these 10 users 10, 0 and 3 of them were active during the weeks of ‘2018–10–15’, ‘2018–10–22’, ‘2018–10–29’ respectively, then for this
cohort there would be 10 active users for period 0 and 3 for period 1 if this step is not carried out. What we need to ensure is that we are correctly reporting 10 for period 0, 0 for period 1 and 3 for period 2.

The output of the above would look like:

This reads (quite phenomenally) as 4 active users from each cohort for each event period shown (note that distinct_id column here shows number of distinct users since it’s the result of a groupby-aggregate method on the distinct_id column; we could rename that, but will do that later)

4) Convert ‘event_period’ date to an integer denoting period after acquisition

So far we have calculated the number of active users per cohort per event_period, with both being represented as (weekly) dates. We now need to transform each event_period to a period number after the acquisition period.

For example, for the acquisition cohort ‘2018–10–15’, the event_periods ‘2018–10–15’, ‘2018–10–22’, ‘2018–10–29’ would correspond to periods 0, 1 and 2 respectively. The event period which is the same as the acquisition period will be denoted as “0” here.

First, we need a function that will do the calculation of how many event periods exist for each user cohort. This will be passed to apply after a groupby in the snippet that follows.

We can then make use of the function above, join the cohort_sizes DataFrame from step(1) and pivot out the cohort periods to have the cohorts as the rows index and the cohort_periods as separate columns.

Running the above would give:

Which shows that all cohorts have 4 users, and each subsequent week we are losing a user.

Converting the above to percentage of active users as a total of the size of each cohort:

4) Dealing with NaN values due to periods not reached yet

Only the 1st cohort period will have values for all the subsequent periods. We will write a function to force all the values that physically cannot exist to be NaNs which will help with the visualisation in the next section.

This mask can now be applied to both the values and percentage retention table.

Visualisation

We will use seaborn to plot the retention heatmap. First, for the rows index, we can combine the users cohort and cohort size into a single (cohort, cohort size) tuple for visualisation purposes. This will help put the numbers into perspective if for period 0 you don’t want to count same day activity or when you are looking at the percentage retention table on its own and want a quick reference to the cohort size.

This is achieved using:

The plotting function is:

And to render the plots we can do:

--

--