Cohort Analysis for Customer Transaction Data

Algorithm 4 Fun
4 min readApr 23, 2022

--

Image by pixabay

Digital Marketing is the most sought after role in today’s online ecosystem, and I believe in each of you who work in the field have User Retentions as one of the main KPI.

But with abundant data, Excel files will be less effective in analyzing campaign results.

So, in this article, I’ll show how your Data/BI Analyst or maybe people from the marketing team can utilize Python for measuring user retention rate.

Guidelines :

  1. Data Cleaning and Data Manipulation
  2. Create Cohort Month and Index
  3. Generate User Retention Summary Table
  4. Visualize the Result
  5. Data Cleaning and Manipulation
!pip install openpyxldf = pd.read_excel('/kaggle/input/customer-transaction-dataset/KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = 'Transactions')df.head()
Snapshot of Dataframe

Well, as you guys can see, from the data above, the first row at index 0 is the column name and we want to replace the current column with the first row.

df.columns = df.iloc[0]
df.drop(df.index[0],inplace=True, axis = 0)
df.head()
Snapshot After Replacing Columns

Now, the table looks like what we are familiar with, then moving on, we’ll select a few columns from the data above for our Cohort Analysis purposes.

#Get the necessary columns for Cohort Analysisdf_final = df[['customer_id','transaction_date','online_order','order_status']]
df_final.head()

Since we want to measure user retention from users who have successfully completed a transaction, we’ll filter the data with ‘Approved’ status and remove duplicate values If any.

df_final = df_final[df_final['order_status'] == 'Approved']
df_final = df_final[~df_final.duplicated()]
df_final.info()

All set, and now we’ll create our Cohort Month and Index before calculating our user retention.

2. Create Cohort Month and Index

#Get Transaction Month for the dataframe
import datetime as dt
def get_month(x) :
return dt.datetime(x.year, x.month,1)
df_final['transaction_date'] = pd.to_datetime(df['transaction_date'])
df_final['transaction_month'] = df['transaction_date'].apply(get_month)
#Create Cohort Month per Rowsgroup = df_final.groupby('customer_id')['transaction_month']
df_final['cohort_month'] = group.transform('min')
#Calculate Cohort Index for Each Rowsdef get_date_int(df, column) :
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day

return year, month, day
transaction_year, transaction_month, transaction_day = get_date_int(df_final, 'transaction_month')
cohort_year, cohort_month, cohort_day = get_date_int(df_final,'cohort_month')
#Calculate Year Differences
years_diff = transaction_year - cohort_year
#Calculate Month Differences
months_diff = transaction_month - cohort_month
df_final['cohort_index'] = years_diff*12 + months_diff + 1df_final.head()

So, here are the step by step approach from the above code :

  1. Getting the ‘Month’ Values from each transaction_date data and Create a new column named ‘transaction_month’
  2. Generate ‘cohort_month’ column based on each customer id first transaction month
  3. Calculating the Differences between ‘transaction_month’ and ‘cohort_month’ and store the values into ‘cohort_index’ ( and the +1 is used for indicating the first month of the transaction instead of using 0 )

All of the necessary data points for cohort analysis have been made and the next step is to calculate the retention rate per cohort index.

3. Generate User Retention Summary Table

#Final Grouping to Calculate Total Unique Users in Each Cohort
cohort_group = df_final.groupby(['cohort_month','cohort_index'])
cohort_data = cohort_group['customer_id'].apply(pd.Series.nunique)
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot_table(index = 'cohort_month',
columns = 'cohort_index',
values = 'customer_id'
)
cohort_data.head()

4. Visualize the Result

#Calculate Retention rate per Month Indexcohort_size = cohort_counts.iloc[:,0]retention = cohort_counts.divide(cohort_size, axis = 0)retention = retention.round(3)*100retention.index = retention.index.strftime('%Y-%m')#Plotting Heatmap for Retention Tableimport matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize = (16,10))plt.title('MoM Retention Rate for Customer Transaction Data')sns.heatmap(retention, annot = True, cmap="YlGnBu", fmt='g')plt.xlabel('Cohort Index')
plt.ylabel('Cohort Month')
plt.yticks(rotation = '360')
plt.show()

The Followings are the processes based on the above code :

  1. Calculate Total Unique User based on Its First Transaction Date and Index per Month
  2. Generating Pivot Table from the calculated table and divided all the values based on total users in the first cohort month
  3. Finally, Visualize the final table using heat map to better understand the data.

That’s the end of this article, I hope that you guys can truly utilize the above code for your marketing campaign purposes.

Cohort Analysis has been used by many tech companies to determine whether the campaign is successful or not.

Again, thank you for giving me your support by clapping and saving my stories to your personal read.

Feel free to reach me out or giving comments down below so I can better understand what you guys want to read moving forward.

Thank you !

--

--

Algorithm 4 Fun

Data Science & Analytics | Bringing Data into Actions | Reach me out for Data Discussion !