Churn Rate: What it is and how to calculate it

Nikolay Gusev

Published in

JoomTech

11 min readMar 7, 2023

In this article, we’re going to discuss Churn Rate, as well as look into the following questions:

What is Churn Rate?
Why do you need this metric?
How can you calculate it using SQL?
How can you calculate it using Python?

Our goal is to provide you with answers as quickly and clearly as possible as to why this metric is needed and how to calculate it correctly using various software. This article also gives examples of code that you’re free to use to generate data and calculate Churn Rate on your own.

What is Churn Rate?

Churn Rate is a metric that allows you to answer the question: What proportion of users have stopped using a service, or in other words, have fallen off? A churned (or dropped) user is one who, since the day of their previous activity, has not logged into the service for some fixed period of time (10 days, 2 weeks, 1 month, etc.). Why is a time frame necessary? The answer is simple — we cannot wait forever to say for sure whether a particular user will return or not. For this reason, we calculate within a certain period, and:

The larger the time frame, the more accurate the metric.
The smaller the time frame, the sooner we will know the result.

Example: On December 1, 100 users logged into our app, and we are interested in what proportion of them will not log into the service again within 28 days. Let’s say there were 17 such users → Churn Rate = 17%. (in this case, we can calculate the metric only after 28 days from the starting date).

Calculation formula

Churn Rate = [number of dropped users] / [number of users who were active on the selected day]

Returning to the previous example, we have:

Day X: December 1
Time frame: 28 days
Date X + Time frame: Current date + 28 days (until December 29)
Users who were active on the selected day: 100 people who logged into the app on December 1
Churned users: 17 people who logged into the app on December 1 but didn’t log in again for 28 days, i.e. until December 29

It’s important to understand that Churn Rate in most cases will not be equal to 1 — Retention Rate (the proportion of users who were active on date X and returned to the service after a certain period of time).

Why is it that Churn Rate ≠ 1 — Retention Rate? It’s because within a given period, users don’t fall off in one big group at the same time. Here is a Retention Rate chart for different periods on date X:

We see that some of users were active during the Churn time frame → if a user was inactive on day 28, this does not mean that they were also absent throughout the previous period, so Churn Rate ≠ 1 — Retention Rate (but if Retention Rate were cumulative, then the equality would hold).

Metric calculation in practice

Software:

SQL — Google BigQuery
Python — Jupyter Notebook
BI — Data Studio

To calculate the metrics, we will use synthetic data. We generate it like this:

Import libraries

import names
import random
import datetime as dt
import pandas   as pd

2. Generate an array of values for the future dataframe (may take several minutes)

# this will store all the usernames we will use
# we will have no more than 10000 such names (they will not necessarily be unique)
names_list = []
for i in range(10000):
    names_list.append(names.get_full_name())
    
# values to be substituted randomly
platforms = ['ios', 'android']
columns   = ['event_date', 'platform', 'name', 'event_type']
events    = ['purchase', 'productPreview', 'addToCart', 'productClick', 'search', 'feedbackSent']

# form the correspondence of the name and platform of the user
# (for simplicity, we assume
# that each user only uses one of the two possible platforms)

names_platform_dict = {}
for name in names_list:
    names_platform_dict[name] = random.choice(platforms)
    
# form arrays of users according to the degree of uniqueness
regular_users     = names_list[:2000]
not_regular_users = names_list[2000:]

# rows for the future dataframe will be stored here
df_values = []

# main loop
# go through all dates from December 1, 2021 to May 29, 2023

for date_index in range(180):
    date = dt.date(2022, 12, 1) + dt.timedelta(days=date_index)
    
    # how many lines will correspond to one day
    rows_number  = random.randrange(2000, 3000)
    
    for name_index in range(rows_number):
        # here we set the weekly seasonality
        # on weekdays the proportion of regular users will be higher

        if (date_index + 2) % 7 == 0 or (date_index + 3) % 7 == 0:
            list_of_names = random.choices([regular_users, not_regular_users], weights = [4, 1])[0]
        else:
            list_of_names = random.choices([regular_users, not_regular_users], weights = [7, 1])[0]
        
        name       = random.choice(list_of_names)
        event_type = random.choice(events)
        platform   = names_platform_dict[name]
        
        # add a row to the array of all rows for the future dataframe
        df_values.append([date, platform, name, event_type])

3. Received dataframe

# create a dataframe
events            = pd.DataFrame(data=df_values, columns=columns)
events.event_date = pd.to_datetime(events.event_date)
events

In our case, the original dataframe looks like this:

You can also find basic information about the dataframe

4. Data upload

# In the source directory where the file with the code is located, you can find the csv file
events.to_csv('./example_events.csv')

Context

We work for a company that sells goods via a smartphone app. The app is available on two platforms: Android and iOS. The company considers a churn metric with a time frame of 28 days.

How to calculate Churn Rate using SQL

We upload the data to Google BigQuery (any environment can be used).

Let’s move on to calculating the metrics.

Since it’s customary to calculate Churn Rate with a time frame of 28 days (in the described case), it would be incorrect to calculate it for those dates from the moment of which the 28th day has not yet come. For this reason, we will cut these lines at the last stage of the query.
Since a user can have multiple events on the same day, it’s a good idea to leave only the unique matches for platform, user and date beforehand.
For each user on the date in question, we find the date of their next event, then calculate the number of days before this date. The resulting number of days will be compared with the time frame of the metric. If a user has been absent for ≥ 28 days from the date in question, they are considered “churned”.

This results in the following code:

-- find unique matches for the name, platform and date of activity

-- we only care if a user was active on a particular day
-- it doesn't matter what actions they performed

WITH names_days AS (
    SELECT DISTINCT
        platform,
        event_date,
        name,
        -- with the window function, find the date of the next user activity
        LEAD(
            event_date
        ) OVER(PARTITION BY name ORDER BY event_date) AS next_event_date
    FROM example_events
),

main AS (
    SELECT
        *,
        -- find how many days later the next user event happened
        DATE_DIFF(next_event_date, event_date, DAY) AS days_till_next_event
    FROM names_days
)

SELECT
    event_date,
    -- find the share of users:
       -- who have been absent for at least 28 days from this day out of all users
       -- and who were active that day

     -- you can also add a slice by platform here

     -- `distinct` below is not needed in this case,
     -- but with a different format of the source data, it may be needed

    COUNT(DISTINCT
        IF(
            days_till_next_event < 28,
            NULL,
            name
        )
    ) / COUNT(DISTINCT name) AS churn_rate
FROM main
WHERE 1 = 1
    -- cut rows where 28 days have not passed since the current date 
      AND event_date <= (SELECT DATE_SUB(MAX(event_date), INTERVAL 28 DAY) FROM example_events)
GROUP BY 1
ORDER BY 1

If there are no window functions in the version of SQL you are using, you can:

Access the source table sorted by name, event_date through a subquery and apply the LEAD / LAG function without a window (providing in the SELECT clause a condition for matching the name. Example: IF(LAG(name) = name, LAG(event_date), NULL)
Number the rows in the source table sorted by name, event_date and join it to itself through table.index = table.index + 1 AND table.name = table.name

At the output, we get a table in which the Churn Rate indicator is calculated for each date with a time frame of 28 days:

The result of the received query is not very convenient to analyse if we’re looking at the result of the SQL query in a table format. Therefore, we visualise the result using a BI tool:

We may also be interested in how the metric behaves depending on the chosen platform. In order to display such a graph, you need to add the additional grouping field platform to the original SQL query (also, do not forget to specify it in SELECT)

In both cases, seasonality is observed: on weekends, the outflow of users is higher.

How to calculate Churn Rate using Python

Import libraries

import numpy             as np
import datetime          as dt
import pandas            as pd
import seaborn           as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

Load data from the csv file

events = pd.read_csv('./example_events.csv', parse_dates = ['event_date'])

Let’s perform similar transformations using Python. To begin with, let’s leave only unique combinations of name, platform, event_date

events_grouped = events.groupby(['event_date', 'platform', 'name'])\
                       .agg({'event_type' : 'count'})\
                       .reset_index()

events_grouped.columns = ['event_date', 'platform', 'name', 'events_number']

Now for each user, we’ll find the date of their next event. We’ll also find how many days later their next event took place

events_grouped['next_event_date']      = events_grouped.groupby(['name', 'platform'])['event_date'].shift(-1)
events_grouped['days_till_next_event'] = (events_grouped['next_event_date'] - events_grouped['event_date']) / np.timedelta64(1, 'D')

For simplicity, let’s add a marker for whether we consider the user to be churned on the date in question

events_grouped['is_churned_int'] = events_grouped.days_till_next_event.apply(lambda x: 0 if x < 28 else 1)

After all the transformations, the data looks like this:

We now have everything we need to calculate the Churn Rate. Let’s find the maximum date from which 28 days have passed, then cut off unnecessary lines. After that, for each date, we calculate the number of unique users (DAU) and the number of those who are in the outflow. Then we calculate the share of those who are in the outflow of the total.

# we look for the maximum date for which the metric calculation can be considered correct
max_relevant_date = events_grouped.event_date.max() - dt.timedelta(days=28)

churn_rate = events_grouped.query('event_date <= @max_relevant_date')\
                           .groupby(['event_date'])\
                           .agg({'is_churned_int' : 'sum', 'name' : 'nunique'})

churn_rate.columns = ['churned', 'dau']

# calculate the metric
churn_rate['churn_rate']      = churn_rate['churned'] / churn_rate['dau']
# convert it to a percentage
churn_rate['churn_rate_perc'] = churn_rate['churn_rate'] * 100

We get the following table with our Churn Rate:

We visualise the results obtained using Python

palette = sns.color_palette(['#ff6063', '#32b6aa', '#ffcc53'])

sns.set_palette(palette)

plt.figure(figsize=(20,10))
plt.grid()

ax = sns.lineplot(data=churn_rate, y='churn_rate_perc', x='event_date')

ax.set_title ('Churn Rate', fontsize=15)
ax.set_xlabel('Date', fontsize=15)
ax.set_ylabel('Outflow, %', fontsize=15)

ax.set(ylim=(0, 50))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())

plt.show()

If we want to calculate the metric separately for each platform, then that platform field must be added to the grouping fields, and also placed in the legend when plotting charts

churn_rate_platform = events_grouped.query('event_date <= @max_relevant_date')\
                                    .groupby(['event_date', 'platform'])\
                                    .agg({'is_churned_int' : 'sum', 'name' : 'nunique'})

churn_rate_platform.columns = ['churned_names', 'dau']

churn_rate_platform['churn_rate']      = churn_rate_platform['churned_names'] / churn_rate_platform['dau']
churn_rate_platform['churn_rate_perc'] = churn_rate_platform['churn_rate'] * 100

plt.figure(figsize=(20,10))
plt.grid()

ax = sns.lineplot(data=churn_rate_platform, y='churn_rate_perc', x='event_date', hue='platform')
ax.set(ylim=(0, 50))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())

ax.set_title ('Churn Rate', fontsize=15)
ax.set_xlabel('Date', fontsize=15)
ax.set_ylabel('Outflow, %', fontsize=15)

plt.show()

Conclusion

We’ve analysed a traditional method of calculating Churn Rate. In some cases, it’s more efficient to calculate this metric in a different way — it all depends on the business context and the task being solved.

Usually we are not interested in specific events, like product clicks or search queries. We just need to know whether a user was active during the time frame. Hence, Churn Rate may be calculated using a wide range of events that may be attributed to user activity. However, there are situations where certain events should be disregarded. For example, if we’re talking about search engines, which are often set in the browser as the “home page”, taking into account something from the “user opened the page” category when calculating Churn would not be very useful, because this can hardly be considered as genuine user activity.

Possible questions

There are dates for which we still do not know whether the user logged in in the subsequent period or not (we’re talking about the metric calculation time frame). What happens if you don’t cut such lines?

Starting from your current (or last available) date, minus the time frame, the metric will skyrocket. For the most recent day, its value will hit 100% since we won’t know about the subsequent activities of these users — in the calculation they will be considered as part of the outflow. The graph in this case would look something like this:

2. But the metric doesn’t tell us who the service has truly lost forever (Actual Churn)!

That’s true, but what is “Actual Churn” anyway? We can never know for sure whether a user will log in to our service again (even if they’ve been absent from it for several years). The metric is needed in order to track a trend in dynamics, thus it will not tell us how many users we have lost forever.

3. We noticed an anomalous change in Churn Rate, what are the causes?

There can be infinitely many reasons for changes in Churn Rate, and for each area they will be different. Let’s try to cover the main ones:

Advertising

budget reallocation;
advertising campaigns have been adjusted;
new ad channels were added.

2. Product changes

prices have been changed;
switched to a different type of product (application, website, widget, etc.);
Certain features have been added or removed (the app now has the ability to make calls or now it is impossible to pay for goods with certain kinds of electronic wallets).

3. Other

Competitors’ activity;
Influence of substitute goods;
Current events and news.

It’s more important to be able to answer the following questions:

On the basis of what audience is the metric calculated?
What could have caused the negative experience?
What can disturb the audience now?
How can I choose a time frame for calculating the metric?

The size of the time frame used depends on your specific business and expectations. We need to understand that using a larger time frame, we will wait longer, but we will get more accurate results.

Thank you for your attention!

We appreciate you reading till the end! We hope the article was useful.

Should you have any questions, feel free to ask them in the comments below — we’ll be happy to answer them. Also, if you think it’s worth considering other topics related to metric construction or business analytics, let us know. We’ll try to touch on them in a future post.