Cohort Analysis Using Python and Pandas

Sanchika Gupta
The Startup
Published in
10 min readSep 3, 2020
Source: Unsplash

Backstory

I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.

This article provides an insight into what cohort analysis is and how to analyze data for plotting cohorts. There are various ways to do this, I have discussed a specific approach that uses pandas and python to track user retention. And further provided some analysis into figuring out the best traffic sources (organic/ inorganic) for an organization.

Cohort Analysis

Let’s start by introducing the concept of cohorts. Dictionary definition of a cohort is a group of people with some common characteristics. Examples of cohort include birth cohorts (a group of people born during the same period, like 90’s kids) and academic cohorts (a group of people who started working towards the same curriculum to finish a degree together).

Cohort analysis is specifically useful in analyzing user growth patterns for products. In terms of a product, a cohort can be a group of people with the same sign-up date, the same usage starts month/date, or the same traffic source.

Cohort analysis is an analytics method by which these groups can be tracked over time for finding key insights. This analysis can further be used to do customer segmentation and track metrics like retention, churn, and lifetime value. There are two types of cohorts — acquisitional and behavioral.

  • Acquisitional cohorts — groups of users on the basis of there signup date or first use date etc.
  • Behavioral cohorts — groups of users on the basis of there activities in a given period of time. Examples could be when they install the app, uninstall the app, delete the app, etc.

In this article, I will be demonstrating the acquisition cohort creation and analysis using a dataset. Let’s dive into it:

Setup

I am using pandas, NumPy, seaborn, and matplotlib for this analysis. So let’s start by importing the required libraries.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

The Data

This dataset consists of usage data from customers for the month of February. Some of the users in the dataset started using the app in February (if ‘isFirst’ is true) and some are pre-February users.

df=pd.read_json(“data.json”)
df.head()

The data has 5 columns:

date: date of the use (for the month of February)timestamp: usage timestampuid: unique id assigned to usersisFirst: true if this is the user’s first use everSource: traffic source from which the user came

We can compute the shape and info of the dataframe as follows

df.shape(4823567, 5)df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4823567 entries, 0 to 4823566
Data columns (total 5 columns):
date datetime64[ns]
isFirst bool
timestamp datetime64[ns]
uid object
utmSource object
dtypes: bool(1), datetime64[ns](2), object(2)
memory usage: 151.8+ MB

Below is a table of contents for the data analysis. I will first show the data cleaning that I did for this dataset, followed by the questions that this exercise will answer. The most important part of any analysis based project is what questions are going to be answered by the end of it. I will be answering 3 questions (listed below) followed by some more analysis, summary, and conclusions.

Table of Contents

Data Cleaning

Here are some of the tasks I performed for cleaning my data.

Null values:

  • Found out the null values in the dataframe: Source had 1,674,386 null values
  • Created a new column ‘trafficSource’ where null values are marked as ‘undefined’

Merge Traffic Sources:

  • Merged traffic Sources using regular expression: facebook.* to facebook, gplus.* to google, twitter.* to twitter
  • Merged the traffic sources with < than 500 unique users to ‘others’. This was done because 11 sources had only 1 unique user, another 11 sources had less than 10 unique users, and another 11 sources had less than 500 unique users.
  • Finally reduced the number of traffic sources from 52 to 11.
df.describe()
#Let’s check for null values
df.isnull().sum()
date 0
isFirst 0
timestamp 0
uid 0
utmSource 1674386
dtype: int64

Looks like Source has a lot of null values. Almost ~34% of the values are null. Created a new column ‘trafficSource’ where null values are marked as ‘undefined’

Next, I took care of similar traffic sources like facebook, facebookapp, src_facebook, etc by merging them.

Found out the unique users from each traffic source. This was done to figure out — if some of the traffic sources have very few unique users as compared to others then they all can be merged. This reduces the number of data sources that we have to analyze without any significant loss in the accuracy of the analysis. And so I merged the traffic sources with less than 500 (0.2% of the total) unique users to ‘others’. Finally reducing the number of traffic sources from 52 to 11.

Now let’s answer the questions.

Question 1:

Show the daily active users (DAU) over the month.

A user is considered active for the day if they used the app at least once on a given day. Tasks performed to answer this question:

  • Segregated the users who started using the app in February from all the users.
  • Calculated the DAU for
    - users who started in the month of February
    - total number of active users
  • plotted this on a graph.
Figure 1: Daily active users for the month of February

Figure 1 shows the daily active users (DAU) for the month of February. I have plotted 2 graphs; one is the DAU plot for all the users and the other is the DAU plot for those users who started using the app in February. As we can see from the graph, the daily active count for Feb users is increasing in number but the DAU plot for all users has significant periodic dips (which could be attributed to less usage during the weekends) with slower net growth as compared to just the users for February.

Question 2

Calculate the daily retention curve for users who used the app for the first time on specific dates. Also, show the number of users from each cohort.

The dates which were considered for creating cohorts are Feb 4th, Feb 10th, and Feb 14th. Tasks done to answer this question are:

  • Created cohorts for all the users who started on the above dates.
  • Calculated daily retention for the above dates as starting dates; for each day of February. The daily retention curve is defined as the % of users from the cohort, who used the product that day.

The function dailyRetention takes a dataframe and a date (of cohort creation) as input and creates a cohort for all the users who started using the app on date ‘date’. It outputs the total number of unique users in that cohort and the retention in percentage from the starting date for each day of February.

Figure 2 shows the total number of unique users from each cohort.

Figure 2. Number of unique users from each cohort

The below code makes the data ready for creating a heatmap by adding a cohort index and then pivoting the data with index as Cohort start dates, columns as days of February, and values as a percentage of unique users who used the app on that day. And then the code further plots a heatmap.

Figure 3. Retention rate (%)

Figure 3 shows a heatmap for the daily retention for users who used the app for the first time on Feb 4th, Feb 10th, and Feb 14th. From the heatmap, we can see 100% retention on the first day of usage. And retention decreases to as low as ~31% for some of the days.

Figure 4 shows the Daily retention curve for the month of February. Initially the retention is 100% but it keeps on decreasing and becomes stable after a week.

Figure 4. Daily retention curve

This retention curve immediately reflects an important insight — about 40–50% of the users stop using the app after the 1st day. After that initial large drop, a second brisk drop occurs after the 10th day — to under 35–45%, before the curve starts to level off after the 13th day, leaving about 37–43% of original users still active in the app at the last day of February.

The above retention curve indicates that a lot of users are not experiencing the value of the app, resulting in drop-offs. Hence, one way to fix that is to improve the onboarding experience which can help the users in experiencing the core value as quickly as possible, thereby boosting retention.

Question 3

Determine if there are any differences in usage based on where the users came from. From which traffic source does the app get its best users? Its worst users?

The tasks performed to answer this question are:

  • Data Cleaning: To clean user ids with duplicate sources.
  • Feature Engineering: Feature engineered new features to find out the best and worst sources.

Data Cleaning:
Identifying the best or the worse sources required some data cleaning. There are some users who had more than one traffic source. I did some data cleaning to remove/merge these multiple sources.

  • 1.64% of user ids i.e. 4058 unique uids had more than 1 source.
  • Since the duplicate traffic source uid count is not significant and there was no reliable way to attribute a single source to these uids, I simply removed these uids from my analysis.

1.64% of user ids have more than 1 source.

The below code creates a group of all users with multiple sources and drops those users from our dataframe and creates a new dataframe ‘dfa’.

Feature Engineering:
In this section, I have feature engineered 2 different metrics to find out the differences in usage based on the traffic source. Here are the metrics:

  1. Total number of unique active users per source per day
    The first metric is a purely quantitative metric calculated on the basis of how many users we are getting from each source and their activity per day throughout the month. The below code calculates this metric and plots a graph to visualize the results.
Figure 6. Total number of unique active users per source per day

Figure 6 plots this information using a line plot. From the plot, one can see that biznesowe+rewolucje and undefined sources are getting the most users but there is a dip in the usage on weekends. And sources like program, answers, shmoop, twitter, grub+street, and handbook have constant usage throughout the month but the unique users contributed are low.

2) Number of days active/source
The second metric I calculated is the number of days active/source. For this, I have grouped the data per traffic source and uid and counted the number of unique dates. So this gives us the number of days for each traffic source when each uid was active. I have plotted this information on a KDE graph and on analyzing the graph it’s evident that the distribution for all sources is bimodal with peaks near 0 and 29 days. The best traffic sources can be defined as ones with a peak at 29 and the worst ones with a peak at 0.

Figure 7. Number of days active/source

Figure 7 shows us a KDE graph for the number of days active per source for the month of Feb. From the graph, it can be seen that the best sources with a mode at 29 (most of the users from these sources used the app for 29 days) are shmoop and twitter closely followed by program, salesmanago, and grub+street with peaks at 29, 28 and 27 respectively. The worst source is the undefined with a mode of 0 despite getting the most users, followed by answers and biznesowe+rewolucje. If I were to define the traffic sources from best to worst based on this graph above, this would be my sequence: shmoop, twitter, program, salesmanago, grub+street, other, handbook, mosalingua+fr, biznesowe+rewolucje, answers, and undefined.

Analysis

User behavior depends on the kind of metric that is important for a business. For some businesses daily activity(pings) can be an important metric and for some businesses, more activity (pings) on certain days of the month has more weight than the daily activity. One would define the worst users and best users based on what is important for the product/organization.

Summary

If the total number of unique active users is an important metric for the product than the first graph can be used to see which sources are best/worst — more number of users indicate better traffic source.

But if we want to see their activity over the month and analyze how many days the users from a particular source were active for the month then the second metric becomes important. In this case, we found out that even if the source (eg Shmoop, twitter) is giving a lesser number of unique active users per day, the users which are coming are using the app for a longer period of time.

Conclusions

In this article, I showed how to carry out Cohort Analysis using Python’s pandas, matplotlib, and seaborn. During the analysis, I have made some simplifying assumptions, but that was mostly due to the nature of the dataset. While working on real data, we would have more understanding of the business and can draw better and more meaningful conclusions from the analysis.

You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me in the comments.

--

--