Retention Series

How to build user cohort properly

Many analysts and even analytics vendors need to correct this mistake

Paul Levchuk
4 min readMar 15, 2024

In the previous post, I briefly mentioned two ways to visualize user cohorts: Calendar-based and Lifetime-based.

The Calendar-based cohorts are very easy to build:

  1. take Matrix Visual in Power BI (or Pivot Table in Excel),
  2. add two columns to axes (Y-axis: signup_date, X-axis: event_date), and
  3. add a measure [# users]:
Calendar-based cohorts (right-aligned).

As it’s not convenient to compare the retention of different cohorts, building left-aligned cohorts is recommended.

To build a left-aligned cohort we need to switch from calendar dates to user lifetime that is we need to build Lifetime-based cohorts.

In the Lifetime-based approach, in the horizontal axis instead of using dates, we use periods when users did some actions: [period_d] = [event_date] — [signup_date].

Let’s do it, let’s build Lifetime-based cohorts:

Lifetime-based cohorts, absolute figures.

As you can see both charts above have the same figures and that’s expected.

As acquisition cohorts can be different by size, it’s difficult to compare them. To get it simpler let’s calculate the percentage of users to users in the initial period t = 0.

Let’s build Lifetime-based cohorts with percentages:

Lifetime-based cohorts. Periods are calculated by day intervals.

Looks good? I don’t think so.

The mistake we made above is that we used [signup_date] and [event_date] without taking into account at what time these two events happened.

User Acquisition could happen for the whole 24 hours. It means that:

  • some users could sign up at 00:30
  • other users could sign up at 23:30

Users who signed up at 00:30 had 23 hours and 30 minutes to interact with the product, get the first piece of value, and return the next day.

At the same time, users who signed up at 23:30 had just 30 minutes to interact with the product, get the first piece of value, and return the next day.

Obviously, these two groups of users have unequal conditions.

So, let’s build Lifetime-based cohorts taking into account the time part: [period_24h] = INT( DATEDIFF( [signup_dt], [event_dt], HOUR ) / 24 ).

Lifetime-based cohorts, percentages. Periods are calculated as 24-hour intervals.

Have you noticed that something has changed?

To make it clear, let me put both calculations into one chart:

Cohorts with different period approaches: day intervals vs 24-hour intervals.

There are a few general observations that we can make from the chart above:

  • Lifetime-based cohorts calculated by 24-hour intervals in general look worse compared to Lifetime-based cohorts calculated by day intervals
  • the largest difference is in the first period (t = 1)
  • it seems that over time the difference between approaches becomes less noticeable

To have a better sense of the difference let’s compare the Cohorts period_d / Cohorts period_24h ratio over time:

Cohorts period_d / Cohorts period_24h ratio over time.

From the chart above we can learn the following:

  • at period t = 1, the difference between approaches is whooping 1.81x times!
  • at period t = 2, the difference is smaller, but still quite big: 1.29x times
  • the median difference of approaches is 1.11x times

SUMMARY

While it looks like a simple task to build a Lifetime-based cohort, it’s important to do it properly:

  • never use the day interval approach as it’s overly optimistic (from my experience the difference could be between 10 and 30% on average)
  • in the first period, the retention of cohorts calculated by the day intervals will be completely wrong!
  • 24-hour interval is the only reliable way to build and assess user cohorts

In the next post, I will show how to color and read user cohorts properly.

--

--

Paul Levchuk

Leverage data to optimize customer lifecycle (acquisition, engagement, retention). Follow for insights!