Cohort Analysis for Customer Transaction Data

4 min readApr 23, 2022

Digital Marketing is the most sought after role in today’s online ecosystem, and I believe in each of you who work in the field have User Retentions as one of the main KPI.

But with abundant data, Excel files will be less effective in analyzing campaign results.

So, in this article, I’ll show how your Data/BI Analyst or maybe people from the marketing team can utilize Python for measuring user retention rate.

Guidelines :

Data Cleaning and Data Manipulation
Create Cohort Month and Index
Generate User Retention Summary Table
Visualize the Result
Data Cleaning and Manipulation

!pip install openpyxldf = pd.read_excel('/kaggle/input/customer-transaction-dataset/KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = 'Transactions')df.head()

Well, as you guys can see, from the data above, the first row at index 0 is the column name and we want to replace the current column with the first row.

df.columns = df.iloc[0]
df.drop(df.index[0],inplace=True, axis = 0)df.head()

Now, the table looks like what we are familiar with, then moving on, we’ll select a few columns from the data above for our Cohort Analysis purposes.

#Get the necessary columns for Cohort Analysisdf_final = df[['customer_id','transaction_date','online_order','order_status']]
df_final.head()

Since we want to measure user retention from users who have successfully completed a transaction, we’ll filter the data with ‘Approved’ status and remove duplicate values If any.

df_final = df_final[df_final['order_status'] == 'Approved']
df_final = df_final[~df_final.duplicated()]
df_final.info()

All set, and now we’ll create our Cohort Month and Index before calculating our user retention.

2. Create Cohort Month and Index

#Get Transaction Month for the dataframe
import datetime as dtdef get_month(x) :
    return dt.datetime(x.year, x.month,1)df_final['transaction_date'] = pd.to_datetime(df['transaction_date'])
df_final['transaction_month'] = df['transaction_date'].apply(get_month)#Create Cohort Month per Rowsgroup = df_final.groupby('customer_id')['transaction_month']
df_final['cohort_month'] = group.transform('min')#Calculate Cohort Index for Each Rowsdef get_date_int(df, column) :
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    
    return year, month, daytransaction_year, transaction_month, transaction_day = get_date_int(df_final, 'transaction_month')
cohort_year, cohort_month, cohort_day = get_date_int(df_final,'cohort_month')#Calculate Year Differences
years_diff = transaction_year - cohort_year#Calculate Month Differences
months_diff = transaction_month - cohort_monthdf_final['cohort_index'] = years_diff*12 + months_diff + 1df_final.head()

So, here are the step by step approach from the above code :

Getting the ‘Month’ Values from each transaction_date data and Create a new column named ‘transaction_month’
Generate ‘cohort_month’ column based on each customer id first transaction month
Calculating the Differences between ‘transaction_month’ and ‘cohort_month’ and store the values into ‘cohort_index’ ( and the +1 is used for indicating the first month of the transaction instead of using 0 )

All of the necessary data points for cohort analysis have been made and the next step is to calculate the retention rate per cohort index.

3. Generate User Retention Summary Table

#Final Grouping to Calculate Total Unique Users in Each Cohort
cohort_group = df_final.groupby(['cohort_month','cohort_index'])cohort_data = cohort_group['customer_id'].apply(pd.Series.nunique)
cohort_data = cohort_data.reset_index()cohort_counts = cohort_data.pivot_table(index = 'cohort_month',
                                        columns = 'cohort_index',
                                        values = 'customer_id'   
                                       )cohort_data.head()

4. Visualize the Result

#Calculate Retention rate per Month Indexcohort_size = cohort_counts.iloc[:,0]retention = cohort_counts.divide(cohort_size, axis = 0)retention = retention.round(3)*100retention.index = retention.index.strftime('%Y-%m')#Plotting Heatmap for Retention Tableimport matplotlib.pyplot as plt
import seaborn as snsplt.figure(figsize = (16,10))plt.title('MoM Retention Rate for Customer Transaction Data')sns.heatmap(retention, annot = True, cmap="YlGnBu", fmt='g')plt.xlabel('Cohort Index')
plt.ylabel('Cohort Month')
plt.yticks(rotation = '360')plt.show()

The Followings are the processes based on the above code :

Calculate Total Unique User based on Its First Transaction Date and Index per Month
Generating Pivot Table from the calculated table and divided all the values based on total users in the first cohort month
Finally, Visualize the final table using heat map to better understand the data.

That’s the end of this article, I hope that you guys can truly utilize the above code for your marketing campaign purposes.

Cohort Analysis has been used by many tech companies to determine whether the campaign is successful or not.

Again, thank you for giving me your support by clapping and saving my stories to your personal read.

Feel free to reach me out or giving comments down below so I can better understand what you guys want to read moving forward.

Thank you !

Customer Transaction - Cohort Analysis

Explore and run machine learning code with Kaggle Notebooks | Using data from Customer_transaction_dataset

www.kaggle.com

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Cohort Analysis for Customer Transaction Data

Customer Transaction - Cohort Analysis

Explore and run machine learning code with Kaggle Notebooks | Using data from Customer_transaction_dataset

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Algorithm 4 Fun