How to measure statistical significance in retention cohorts?

Credit: Paul Levinson

When we measure retention rate (or churn rate) of customer cohorts at TouchNote, it’s important for us to ensure any detected differences are statistically significant, and that these differences have sufficient statistical power. (For a quick explanation why you have to validate for statistical power as well as significant, read here).

Luckily there’s an easy web tool built by Evan Miller that lets you just that very easily. Here’s how to test for power and significance in a jiffy.

Test for statistical power

First, you want to test that the sample size (i.e. the size of each cohort) is large enough. There’s no point testing for statistical significance if your results do not have sufficient power.

Assume your data for the two cohorts are these:

  1. Go to Evan Miller’s Sample Size Calculator
  2. Enter the assumptions of baseline retention — in this case, 58% to correspond to the retention of the first week’s cohort. Enter this in the top text box.
  3. Enter the improvement expected between the two cohorts. Let’s say in this case, we expect a 5% improvement — so enter 5% in the bottom text box. Then, select ‘Relative’ because we expect a 5% improvement relative to the base one (i.e. 5% improvement of 58% which is roughly 3% in absolute terms)
  4. Then, enter your assumption for the level of statistical power (1-β) required. This refers to the minimum chance the desired effect size (in our case, 5%) will be detected. The default is 80% but I suggest increasing this to 90%.
  5. Lastly, enter the significance level (α). This refers to the maximum chance the desired effect will be detected incorrectly — i.e. that it will be detected even though it does not actually exist. The default is 5%, which is typically fine for most experiments. In this example, I lowered it to 2%. In other words, we’re asking what’s the sample size required in each cohort to have a maximum 2% chance of a ‘false positive’.

Your screen should look like this when you’re done:

This tells you that the minimum sample size (i.e. cohort size) of each cohort needs to be 7,562 users. Luckily, we have more than that in each cohort (7,875 in week 1 and 8,181 in week 2), so we’re OK to proceed.

If for example you increase the statistical power to 95%, you’ll see the minimum sample size increases to 9,165 users.

We can now test if the difference detected in the retention rate between our two cohorts is statistically significant or not.

Test for statistical significance

We’ll now use another statistical test, Chi-squared (χ2), to compare if the difference in retention rate of the cohorts is statistically different.

  1. Go to Evan Miller’s Chi-squared Test
  2. Now, enter the actual retention figures from the data. So, in top row enter 4,617 and 7,875 — to correspond to the number of users retained in the first week out of the users who joined that week.
  3. In the second row, enter 5,049 and 8,181.
  4. Lastly, choose the confidence level (p). This refers to the chance that a difference will be detected if we re-run the test. It’s easier to think about 1-p, as the chance that we will detected a statistically significance difference whereby one does not actually exist (i.e., a false positive). The default is 95% but let’s set it to 99%.

Your screen should look like this when you’re done:


The difference we detected between the retention rate of the two cohorts is indeed statistically significant. There’s a 99% confidence that the actual retention rate of the first cohort ranges between 57.2% and 60.1%, and that of the second cohort between 60.3% and 63.1%. As these two ranges do not overlap, we can tell with 99% significance that the improvement we detected is real.

If you found it useful, feel free to share this post.