User acquisition stats

Calculating the number of new, active and returning users in python

Published in

Multiply

6 min readSep 17, 2019

This is the first part of the series Mobile Analytics using Python, describing how to calculate your key app usage metrics using nothing but Python.

Code snippets provided in each section and the Github repo can be found here.

Objective

What we are trying to calculate in this section is the number of:

New acquired users per period
Number of acquired users who are active per period
Number of returning users per period (i.e. active users who were acquired during a previous period)

Why bother?

The objective above might sound quite straightforward, but you would be surprised to find out that it could be extremely difficult, if not impossible, to calculate these metrics using out of the box functionalities of analytics tools.

If the user acquisition is more complicated than a person downloading and signing up, or if the event that defines the acquisition point is an event the user can repeat multiple times, then things start getting messy.

Throw duplicated users and other potential data issues into the equation and that’s how you start writing a python module that serves nothing but to correct (preprocess) the raw data before you even start calculating metrics.

I know that analytics tools are not responsible for your data quality, BUT, startups will inevitably go through a journey of figuring out what to capture, how to capture it and the pressure of launching betas/products will lead to bugs in the releases. It is therefore a necessity to be able to do advanced filtering and remapping of your dataset before calculating metrics.

Starting Point

I will assume that you have your data in a pandas DataFrame, named events, with the following columns:

distinct_id: distinct user id
name: event name
properties: a dict object with the properties associated with the event
time: timestamp at which the event took place

Dummy data for illustration purposes (Note: “properties“ not needed for this section)

Requirements

Python (I am using v3.7)
Pandas: for data manipulation
Plotly: for visualisations*

*A note on Plotly: One could use matplotlib instead, but way more lines of code would be needed to make the plots look pretty, be annotated and offer interactivity. We will be creating plots that are easy to dive into and share with other people.

Code

1) Calculate the acquisition time for each user

Set the event that will denote the user acquisition point; this will be declared as acquisition_event_name. Since this can be done multiple times by a user, the first time it was performed for each user needs to be found and this will correspond to the acquisition time for that user.

Note: we could use pandas groupby instead, but found the function bellow to be faster.

The output of the above function would look like:

{5444: Timestamp('2018-10-18 17:26:05'),
 996: Timestamp('2018-10-19 09:44:23'),
 ...,
 5657: Timestamp('2018-10-19 11:12:17')}

A new column could now be added to your events DataFrame by mapping the dict above to the distinct_id column, and this will be done in the function below.

2) Calculate the cohort for each user and period for each event

In mobile analytics, a common practice is to group users into cohorts, defined by the period during which the users got acquired; this could be weekly/monthly.

If a user got acquired on 3rd Sept 2019, the weekly cohort for this user would be ‘2019–09–02" (with weeks starting on a Monday) and the monthly cohort would be “2019–09”.

Example of how the monthly cohort period could be calculated:

Similarly, we can also define the period that any event took place (simply a transformation of the time column to a weekly/monthly period column).

This will allow a comparison between the event_period and the acquisition cohort for each user, which will, in turn, allow for our objective metrics to be calculated.

If any event took place at a time equal to or greater than the acquisition time, the user is tagged as “active” (i.e. user_active would be set to True for that event for that user).

If any event took place during an event_period later than the acquisition period (cohort) for that user, the user is tagged as “returning” (i.e. user_returns would be set to True for that event for that user).

Let’s wrap everything in one function:

Running the above for my dummy data, using a weekly period, here are five random rows of the output DataFrame:

3) Calculate the number of new, active and returning users per period

All that is left now is a series of groups by operations for each metric we are after, and then join the resulting pandas DataFrames/Series together into a single DataFrame.

As a bonus, we can also calculate the number of organic and non-organic acquired users per period.

Organic acquisition is any acquisition that did not come from a paid/marketing channel like Facebook/Google ads. The split between the two is quite important for a startup, with a usual objective being an increase in organic growth over time.

In my dummy data, I have a user_source field within the properties column which I can extract into a separate column:

In the function below, you can define the user_source_column.

The last part of the function above calculates two new metrics:

W/W Growth: week-on-week percentage growth. Ideally, we want this to be always in the positives and increasing with time.
N/R Ratio: new-to-returning users ratio. Indicative of how the period activity is shared between new and returning users.

Running the above for my dummy data, using a weekly period, here is the output DataFrame:

Visualisation

The plot will consist of four sections (separate y-axes):

Breakdown of acquired users by source (organic/non-organic).
Breakdown of user activity (active/returning)
Periodic growth (week-on-week)
N/R Ratio

All four y-axes will share the same date type plotly x-axis to allow interactivity between them.

Before writing the plotting function, a few points on plotly:

We need to create a trace for each series we want to plot and add all of the traces of a single list named data.
Most of the formatting is done in the layout dict.
The plotly figure is a dictionary composed of data and layout.

The function above will transform the data for you (calling the previous functions) and return the plotly figure. To render this in a jupyter notebook all that is left is:

Feel free to interact with the chart below to appreciate plotly ’s functionalities.

This was generated using: