Cohort Analysis with Python

Or how to visualize your customer retention — a code-along guide

Fabian Bosler
Aug 18 · 5 min read
Sample retention matrix —follow along and learn how to create your own matrix

A question that I often face is, “What do our cohorts look like?” Investors use the answer to this question to better understand their customers’ lifetime value. Management uses it to identify well-performing cohorts and their common traits so they can focus on those customers. Finance can also use it, to a certain extent, for forecasting.

What you will learn in this article:

  • What are cohorts?
  • What do cohorts look like?
  • How do you generate dummy data?
  • How do you build the function that will generate the cohorts?
  • Some noteworthy Python techniques


This article is intended as a code-along article so you should have:

  • Basic Python understanding
  • Development environment (I recommend Jupyter Notebook/Lab)

What Are Cohorts?

In statistics, marketing, and demography, a cohort is a group of subjects who share a defining characteristic (typically subjects who experienced a common event in a selected time period, such as birth or graduation).

In the context of business, a cohort is generally the group of customers who had their first purchase in a given time interval (typically months, but this depends on your business model).

Cohorts help understand how a particular group of customers develops over its lifecycle. Questions cohorts may answer are:

  • Is there a high churn? (This will be indicated by a low number of follow-up bookings/purchases.)
  • Do customers become more engaged over time? (This may be demonstrated by customers booking/purchasing more frequently the longer they stay with you.)

What Do Cohorts Look Like?

Sample retention matrix (generated with a python function, sample data)

How Do You Generate Dummy Data?

You will most likely have to install names and seaborn by running them (in your notebook).

!pip install seaborn
!pip install names

2. Set up some seed data.

ADJECTIVES, PEOPLE, and PRODUCTS are all capitalized. In Python, this notation is typically used for variables that are static and/or for settings of a module.

Note how for PEOPLE we used a so-called list comprehension, a very powerful concept in Python. In our case, we call the function names.get_first_name() 10,000 times and put the unique results into the PEOPLE list.

3. Build some helper functions.

4. Build the dummy data function.

5. Generate the dummy DataFrame.

This function generates a sample pandas DataFrame based on dummy_products and dummy_customers. There are additional parameters that you can set, but you don’t have to (like data_points, for example, which specifies the number of rows in the resulting DataFrame). Again, we make use of list comprehensions.

Let's create our dummy DataFrame:

Our dummy data has the following form:

Screenshot of the first rows of the DF generated by “generate_dummy_data”

6. Enrich the dummy data with order types and times of first orders

We are now done with generating the sample data. Feel free to play around with the functions by changing parameters and running the data again.

How to Build the Function That Will Generate the Cohorts

For now, copy-paste the following code block into your Jupyter Notebook and run the cell. The functions become available, but ignore them for the time being.

The following function generates the cohort visualizations:

You can just run it (see the examples below) with different parameters. I’ll explain the parameters after the examples. To create a cohort analysis you would just run:

generate_cohort_analysis(df=df, metric=’number_of_orders’)
generate_cohort_analysis(df=df, metric=’number_of_orders’, period_agg=’monthly’)
generate_cohort_analysis(df=df, metric=’number_of_items_bought’)
generate_cohort_analysis(df=df, metric=’number_of_items_bought’,

To understand the details, let us break this one down a little bit.

How does generate_cohort_anlysis work?

  • a df parameter (which is just the dummy data we created earlier),
  • a metric parameter (which indicates the metric you are curious about, in our example: number_of_orders, number_of_items_bought, or total_order_value)
  • an (optional) record_type parameter, default=all (which lets you subsegment our sample data and only look at a specific group, in our example: all, private, company, or government)
  • an (optional) period_agg parameter, default= quarterly (which lets you choose either monthly or quarterly for your cohorts)
  • an (optional) fig parameter, default=True (which defines whether a figure or the actual data should be produced)
  • an (optional) save_fig parameter, default=True (which defines whether the resulting figure should be saved on disk)
  • an (optional) size parameter, default=10 (which defines the size of the annotations)

You can run the function with your enriched sample data and choose one of the available metrics, and you’ll be presented with a visualization of the new accounts, the retention matrix for that metric, and the return rate of each cohort.

If you want to test the code in an interactive notebook, head here.

Better Programming

Advice for programmers.

Fabian Bosler

Written by

CIO/CMO at! We are HIRING! Business Intelligence, Marketing, Advanced Analytics, and Machine Learning.

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade