Cohort Analysis: Using Python!
What is a cohort?
A cohort is a group of people who share something in common within a specific time frame. For example, a group of people born in New York in the year 1991. This is an example of a “birth cohort”.
Similarly, in the business world cohorts are made up of customers or users. For example:
- number of users that downloaded the app in a given time period, or
- number of users who canceled a subscription during the same month
Cohorts basically make it easier to analyze user behavior and look at trends without having to look at each user individually.
Cohort Analysis
It is a type of behavior analytics that uses data from sources such as web applications and different platforms and breaks that data into multiple cohorts for further analysis. It's an effective way to gather information regarding customer behavior and how they interact with the product. One of the most important uses of cohort analysis is to improve customer retention.
Cohort analysis is valuable due to the specificity of information it provides. It assists companies to answer some of the targeted questions by looking at the relevant data. Some advantages of cohort analysis are:
- Help understand how user behaviors can affect one’s business, such as acquisition and retention
- Understand customer churn
- Calculate the customer lifetime value
- Create more effective customer engagement
Types of cohorts:
- Acquisition cohorts - these cohorts are based on when a user signs up or purchases the product. Acquisition cohorts help businesses figure out customer’s LTV, average churn time, and other retention related insights
- Behavioral cohorts - the users are grouped on the basis of their behavior and actions they take in a given period of time. For example, a time when users install/uninstall the app, which features do users use the most, etc.
How to perform cohort analysis (in Python)?
For the purpose of this project, we’ll leverage the Online Retail II dataset from the UCI Machine Learning repository. This data set contains 525,461 transactions for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.
Features:
InvoiceNo: Uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation
2. StockCode: Product (item) code
3. Description: Product (item) name
4. Quantity: The quantities of each product (item) per transaction
5. InvoiceDate: The day and time when a transaction was generated
6. UnitPrice: Product price per unit in sterling
7. CustomerID: A 5-digit integral number uniquely assigned to each customer
8. Country: Name of the country where a customer resides
In the dataset, one of the columns we have is “InvoiceDate”, the day when a transaction was done. So we will create cohorts and look at customers who remain active during particular cohorts over a period of time.
Data Preprocessing:
Before performing the analysis we need to ensure that our data is clean.
A. Removing all the canceled transactions:
Because the order is canceled, its quantity is negative. In our dataset, we have 10206 or 1.94% canceled transactions that we need to remove. The dataset also has around 2121 entries that aren’t marked as canceled in the dataset but have negative quantities. We remove those entries from the dataset as well.
B. Removing duplicate data:
The dataset contains around 6841 or 1.33% duplicate entries that need to be removed
C. Removing the NaN values:
From the picture on our left, we can see that both the Customer ID and Description column have some missing values.
Approximately 21% of values under the Customer ID column and around 0.22% of values under the Description column are NaN. We remove all the missing values from the dataset.
Our dataset finally has 400,947 transactions and no null columns. Looking at the total number of unique products, transactions, and customers in the data, we observe that there are 4313 unique customers and for 4017 unique products in the data, there are 19215 transactions. This means that each product is likely to have multiple transactions in the dataset.
Performing cohort analysis:
We start by creating cohorts for our analysis. In order to do so, we first create two new columns, “Invoice_day” and “Cohort_month”.
- Invoice_day - indicates the date of the transaction by taking the first day of the month and year of InvoiceDate for each transaction
- Cohort_month - indicates the date of the customer’s first transaction. To create this column we first group our dataset by “Customer ID” and then select the first purchase date by leveraging the min() function
We then find the difference between “Invoice_day” and “Cohort_month” in terms of the number of months and call the column “Index”.
Leveraging the pivot function we create a cohort analysis matrix by grouping the data by Cohort_month and Index, which we created above. The values in the matrix represent unique customers who made a purchase during that time period. The final matrix looks like this:
Interpreting the Cohort Analysis Matrix:
Let's start by looking at one cohort. From left to right, we see that:
- 2009–12–01: This is the “cohort” - customers who made a purchase in Dec of 2009
- 955: This is the number of customers in this cohort. So 955 customers purchased products in Dec of 2009
- 337: This is the number of customers who purchased again in the next month. So in Jan 2010, 337 customers made a purchase, and 618 did not
Each column after the “0” column shows the number of customers from that cohort after each month. So under the “1” column, we can see that 337 out of 955 customers made a purchase in the first month, in the “2” column 319 out of 955 customers made a purchase, and so on.
Calculating Retention Rate:
Retention rate refers to the percentage of customers who continue using/paying for a product over a given timeframe. Using the above cohort analysis matrix we can create a retention rate heatmap.
From the heatmap, we can see that for the Cohort_month: 2010–12–01, the average retention rate is approximately 35%, with the highest retention rate occurring after 11 months (50%).
Thus an effective cohort analysis can help a business uncover useful insights and understand their customer’s purchasing behaviors or their potential customer’s lifetime value.