PropensityBoost: E-commerce Purchase Optimization

Propensity Modelling:

11 min readMay 3, 2024

Introduction:

Ever visited an e-commerce platform, added items to your cart, but then closed the tab without buying anything? You’re not alone. E-commerce companies face a big puzzle: how to turn those curious visitors into paying customers. To increase the number of purchases generally E-commerce companies send discounts or coupons to users to motivate them to buy but small e-commerce startups have only limited funds for this discount campaign so they need to carefully spend their funds

In this blog, we are going to help them solve this problem by building a propensity model through RFM modeling to predict the probability of a customer buying the products added to their cart within the next 2 hours. This enables us to target customers with lower propensity and offer them deals and offers to encourage product purchases.

Data:

To begin with, let’s take a look at what data we have.

# importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
pd.set_option("display.max_rows",500)
pd.set_option("display.max_columns",500)

# loading the data
df=pd.read_excel("data/customer_data.xlsx")

# printing the first 5 rows
print(df.head(3))

# checking the data shape
print(df.shape)

# basic check
print(df.info())

We have 2090 rows and 8 columns, now let’s see about each column

We have 9 unique features in total

User_id: Unique ID for each user
Session_id: Unique ID for the user’s session
DateTime: Timestamp
Category: The product category
SubCategory: The product subcategory
Action: The action performed by the user on the e-commerce platform
Quantity: The quantity of products bought
Rate: Rate of the single unit of product
Total Price: Total price spent by the user in purchasing the product

Exploratory Data Analysis

Please click here to view the detailed EDA of this dataset

RFM Modelling

RFM stands for Recency, Frequency, and Monetary value, each corresponding to some key customer trait. These RFM metrics are important indicators of a customer’s behavior because frequency and monetary value affect a customer’s lifetime value, and recency affects retention, a measure of engagement.

RFM features of customers illustrates these characteristics:

1. Recency: The more recent they purchase, the more responsive they are to the promotions.

2. Frequency: The more frequent they purchase, the more engaged and satisfied they are.

3. Monetary: The more monetary value helps to differentiate high spenders vs low spenders.

# defining a function to generate RFM features

def get_rfm_features(data, cust_id, invoice_date, total_sales):
    # creating a copy of the dataframe
    temp = data.copy()
    # converting the customer_id column to object type
    temp[cust_id] = temp[cust_id].astype("object")

    """ RECENCY"""
    # computing the last purchase feature of the users
    temp['LastPurchase'] = temp.groupby(cust_id)[invoice_date].transform(lambda x: x.max(skipna=True))
    # getting the last invoice
    last_invoice = temp[invoice_date].max() + pd.Timedelta(days=1)
    # compuint the recency feature
    temp['Recency'] = (last_invoice-temp['LastPurchase']).dt.days
    # creating a separate dataframe which consists receny for each users
    users_receny = temp[[cust_id,"Recency"]].drop_duplicates().sort_values(by=cust_id).set_index(cust_id)


    """ FREQUENCY """
    # creating a df which contains frequency features for each users
    users_frequency = pd.DataFrame(temp[cust_id].value_counts().sort_index()).rename(columns={"count":"Frequency"})

    """ MONETARY """
    users_monetary = pd.DataFrame(temp.groupby(cust_id)[total_sales].sum().sort_index()).rename(columns={total_sales: "Monetary"})

    """ RFM Dataframe"""
    # Joining Recency and Frequency DataFrames
    RFM = pd.merge(users_receny, users_frequency, left_index=True, right_index=True)
    # Joining with Monetary DataFrame
    RFM = pd.merge(RFM, users_monetary, left_index=True, right_index=True).reset_index()

    return RFM

RFM = get_rfm_features(data=df[df["Action"]=="purchase"],cust_id="User_id",invoice_date="DateTime",total_sales="Total Price")

Let’s see the data frame generated by the above function

In this data frame we have RFM features for each user, let’s analyze them

# Analyzing the RFM attributes
import matplotlib.pyplot as plt

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot Recency distribution
sns.distplot(RFM['Recency'], ax=axes[0])
axes[0].set_title('Recency')

# Plot Frequency distribution
sns.distplot(RFM['Frequency'], ax=axes[1])
axes[1].set_title('Frequency')

# Plot Monetary distribution
sns.distplot(RFM['Monetary'], ax=axes[2])
axes[2].set_title('Monetary')

plt.show()

# Analyzing the RFM attributes
RFM.drop(columns=['User_id']).describe()

Observations
1. The minimum value for Recency is 1 day and maximum value is 345 days
2. The minimum value for Frequency is 1 and maximum value is 3
3. 75% of frequency values is under 1 which means most people are purchasing only once
4. The minimum value for Monetary is 150 and the maximum value is 160000

RFM Ranking

To Conduct RFM Analysis, we need to rank our customers/users based on each RFM attribute separately, In this project we will score each attribute of RFM of a user from 1–4.

We need to assign scores from 1–4 based on the value of the attribute

Recency (R score)

if the attribute value is less than equal to the 25th percentile assign a higher R score of 4 ( more recency corresponds to a high R score)
if the attribute value falls between the 25th percentile and 50th percentile assign an R score of 3
if the attribute value falls between the 50th percentile and 75th percentile assign an R score of 2
if the attribute value falls between 75th percentile and 100th percentile assign an R score of 1 ( less recency corresponds to a low R score )

2. Frequency (F score)

if the attribute value is less than equal to the 25th percentile assign a least F score of 1 ( less frequency corresponds to a low F score)
if the attribute value falls between the 25th percentile and 50th percentile assign an F score of 2
if the attribute value falls between 50th percentile and 75th percentile assign an F score of 3
if the attribute value falls between the 75th percentile and 100th percentile assign an F score of 4 ( more frequency corresponds to a high F score)

3. Monetary (M score)

if the attribute value is less than equal to the 25th percentile assign a least M score of 1 ( less monetary corresponds to a low M score)
if the attribute value falls between the 25th percentile and 50th percentile assign an M score of 2
if the attribute value falls between the 50th percentile and 75th percentile assign an M score of 3
if the attribute value falls between the 75th percentile and 100th percentile assign an M score of 4 ( more monetary corresponds to a high M score)

# defining a function which assings the score based on the thresholds
def assign_score(x,quants,attribute):
    percentile25 = quants[0.25]
    percentile50 = quants[0.5]
    percentile75 = quants[0.75]

    # low recent ==> value high R score
    if attribute == "Recency":
        if x<=percentile25:
            return 4
        elif (x>percentile25) & (x<=percentile50):
            return 3
        elif (x>percentile50) & (x<=percentile75):
            return 2
        else:
            return 1

    # low frequency/monetary ==> low F/M score
    elif attribute in ("Frequency","Monetary"):
         if x<=percentile25:
            return 1
         elif (x>percentile25) & (x<=percentile50):
            return 2
         elif (x>percentile50) & (x<=percentile75):
            return 3
         else:
            return 4


# we need to assing a score from 1-4 so we need 3 quantiles for each attributes of RFM
recency_quantiles   = RFM['Recency'].quantile([.25,.50,.75]).to_dict()
frequency_quantiles = RFM['Frequency'].quantile([.25,.50,.75]).to_dict()
monetary_quantiles  = RFM['Monetary'].quantile([.25,.50,.75]).to_dict()

# assinging R score for recency
RFM['R'] = RFM['Recency'].apply(lambda x:assign_score(x, recency_quantiles, "Recency"))

# assinging F score for frequency
RFM['F'] = RFM['Frequency'].apply(lambda x:assign_score(x, frequency_quantiles, "Frequency"))

# assinging M score for Monetary
RFM['M'] = RFM['Monetary'].apply(lambda x:assign_score(x, monetary_quantiles, "Monetary"))

# printing the first 5 rows of RFM
print(RFM.head()

We have 4 possible scores for each attribute and we have 3 attributes therefore possible combinations are 4**3=64
For example, if a customer has R,F, M scores of 1,1,1 then he will fall in 111 category like this each user may fall under 64 possible combinations (groups)

# finding the in which group each user falls
RFM['Group'] = RFM['R'].apply(str) + RFM['F'].apply(str) + RFM['M'].apply(str)
RFM.head()

We have 64 possible groups, now we need to create 4 groups Bronze, Silver, Platinum, Gold by combine this 64 groups

# creating RFM scores
RFM['Score'] = RFM[['R','F','M']].sum(axis=1)
RFM.head(5)

Now based on the Score we need to Categorize our customers into 4 Categories

# defining our 4 groups interms of loyalty
loyalty = ["Bronze", "Silver", "Gold", "Platinum"]
# creating a column called Loyalty
RFM['Loyalty'] = pd.qcut(RFM['Score'], q=4, labels=loyalty)

RFM.head(5)

# Get value counts of Loyalty column
loyalty_counts = RFM['Loyalty'].value_counts(normalize=True)

# Plotting the pie chart
plt.figure(figsize=(5, 5))
plt.pie(loyalty_counts, labels=loyalty_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Loyalty')
plt.axis('equal') 
plt.show()

# analyzing the behaviour of these groups
behaviour = RFM.groupby("Loyalty")[["Recency","Frequency","Monetary"]].mean()

# adding this RFM features to our dataset
merged_df = pd.merge(df, RFM, on="User_id",how="left")
merged_df.to_csv("data/customer_data_RFM_features.csv",index=False)
merged_df.head(5)

Training Data

We cannot train a model using our raw data so let’s create the dataset for the model training where each row is the “add_to_cart” performed by the user we set the label 1 if the user is bought the item that was added to the cart within the next 2 hours or else we set label 0

# function to get the label for the row
def get_label(lst):
    
    if len(lst)==1: #it means the user only added_to_cart and haven't bought anything
        return 0
    # Convert strings to datetime objects
    cart_time = pd.to_datetime(lst[0])
    purchase_time = pd.to_datetime(lst[1])
    # Calculate time difference
    time_diff = purchase_time - cart_time
    # Define a timedelta representing 2 hours
    two_hours = pd.Timedelta(hours=2)
    # Check if the time difference is less than or equal to 2 hours
    if time_diff <= two_hours:
        return 1
    else:
        return 0



# function to get the dataset
def get_dataset(data):
    dff= data.copy()
    # filtering the data only for purchase and add_to_cart action
    data = dff[dff['Action'].isin(["purchase", "add_to_cart"])]
    data.sort_values(by=["User_id","DateTime"],inplace=True)

    # grouping rows using "User_id", "Category", "SubCategory" and creating the action list anf respective datetime list
    data_grouped = data.groupby(["User_id", "Category", "SubCategory"]).agg({
        "Action": list,
        "DateTime": lambda x: x.tolist()
    })
    data_grouped.reset_index(inplace=True)
    # filtering rows where add_to_cart action is followed by purchase action
    filtered_data = data_grouped[data_grouped['Action'].apply(lambda x: (x == ['add_to_cart', 'purchase']) or (x == ['add_to_cart'] ))]
    # getting the labels if the difference between the datetime elements in the list is less than 2 hrs
    filtered_data["Label"]=filtered_data["DateTime"].apply(get_label)

    # dropping the unwanted columns
    filtered_data.drop(columns=["Action","DateTime"],inplace=True)

    return filtered_data

# getting the dataset
df_base = get_dataset(df)
print(df_base)

Feature Engineering

We have completed the dataset creation. so let’s engineer some additional features now

RFM features

# adding RFM features for the users
users_rfm_features = df.groupby("User_id").agg({"R":"max","F":"max","M":"max","Loyalty":"max"})
df_base = pd.merge(df_base,users_rfm_features,on="User_id",how="left")
df_base.head()

User activity features

No of days active
Avg time between purchases
Cart to purchase ratio
Wishlist to purchase ratio
Paths (path taken by the user in our platform)

Category and SubCategory level features

Cart to Purchase ratio(Category)
Cart to Purchase ratio (SubCategory)
Wishlist to Purchase ratio (Category)
Wishlist to Purchase ratio (SubCategory)
Click Wishlist to Purchase ratio(Category)
Click Wishlist to Purchase ratio (SubCategory)
View to Purchase ratio (Category)
View to Purchase ratio (SubCategory)

Let’s look at the dataset after feature engineering

Model Building

We have our dataset ready for training so let’s now train the propensity model

import sklearn
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from category_encoders import TargetEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import f1_score,confusion_matrix

# splitting the features and target
y = df['Label']
x = df.drop(columns=['Label'])

# train test splitting
x_train, x_test, y_train, y_test =train_test_split(x, y, test_size=0.2, random_state=42)

Categorical Encoding strategy

The Category and SubCategory features have 89, 232 Categories respectively so if we do OHE it will increase the dimensionality so lets encode this using TargetEncoding
Loyalty and user_path have only few categories so let's encode them using OHE

Imputation

Let's build a ColumnTransformer which imputes missing values according to the column

# defining a lists of cols to impute with -1 
cols_to_impute_with_minus_1 = ["Avg Purchase Gap"]
# defining a lists of cols to impute with 0
cols_to_impute_with_0 = ['days_active', 'R', 'F', 'M', 'add_to_cart_to_purchase_ratios', 'add_to_wishlist_to_purchase_ratios', 'click_wishlist_page_to_purchase_ratios', 'cart_to_purchase_ratios_category', 'cart_to_purchase_ratios_subcategory', 'wishlist_to_purchase_ratios_category', 'wishlist_to_purchase_ratios_subcategory', 'click_wishlist_to_purchase_ratios_category', 'click_wishlist_to_purchase_ratios_subcategory', 'product_view_to_purchase_ratios_category', 'product_view_to_purchase_ratios_subcategory'] 
# definfing list of categorical cols to do OHE
ohe_cols = ["Loyalty","user_path"]
# definfing list of categorical cols to do Target encoding
target_encod_cols = ['Category', 'SubCategory']

# defining the imputer obj
imputer = ColumnTransformer([
                            ('imputer_0', SimpleImputer(strategy='constant', fill_value=0), cols_to_impute_with_0),
                            ('imputer_minus_1', SimpleImputer(strategy='constant', fill_value=-1), cols_to_impute_with_minus_1),
                            ('imputer_categorical', SimpleImputer(strategy='most_frequent'), cat_cols),
                            ])
imputer.set_output(transform="pandas")

Encoding

# getting the name of the first operation of imputer (ColumnTransformer) coz this name will be added as a prefix for all columns of the resulting df by imputer object
numerical_prefix = imputer.transformers[0][0]+"__"
purchase_prefix =  imputer.transformers[1][0]+"__"
categorical_prefix = imputer.transformers[2][0]+"__"

# prefixing the cols 
prefixed_ohe_cols = [categorical_prefix + col for col in ohe_cols ]
prefixed_target_encod_cols = [categorical_prefix + col for col in target_encod_cols]

prefixed_num_cols = [numerical_prefix + col for col in num_cols ]
# replacing the imputer_categorical__days_active imputer_minus_1__days_active
prefixed_num_cols = [f"{purchase_prefix}Avg Purchase Gap" if i == f"{numerical_prefix}Avg Purchase Gap" else i for i in prefixed_num_cols]

# definfing the encoder for categorical and numerical cols
encoder = ColumnTransformer([('scaler_numeric', StandardScaler(), prefixed_num_cols),
                             ('ohe_encoder', OneHotEncoder(sparse=False), prefixed_ohe_cols),
                             ('target_encoders', TargetEncoder(), prefixed_target_encod_cols)
                            ])
encoder.set_output(transform="pandas")

Preprocessor pipeline

# creating a pipeline which consists of imputer followed by encoder
preprocessor = Pipeline([ ('imputer', imputer),
                          ('encoder',encoder)
                        ])
preprocessor.set_output(transform="pandas")

# applying the preprocessor pipeline to the x_train and x_test
x_train = preprocessor.fit_transform(x_train,y_train)
x_test  = preprocessor.transform(x_test)

Training

models={ 'Logistic Regression': LogisticRegression(),
                     'Decision Tree': DecisionTreeClassifier(),
                     'KNN': KNeighborsClassifier(),
                     'RF':RandomForestClassifier(),
                     'GBDT':GradientBoostingClassifier(),
                     'XGBoost': XGBClassifier()
                     }
for i,model in models.items():
    # fitting the model
    model.fit(x_train,y_train)
    #Predict on test data
    y_test_pred = model.predict(x_test)
    # printing the f1_score
    print(f"The f1 score of {i} is {f1_score(y_test,y_test_pred)}\n")
    # compting the conf matrix
    conf_matrix = confusion_matrix(y_test, y_test_pred)
    # Plotting the confusion matrix as a heatmap
    plt.figure(figsize=(5, 3))
    sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g')
    plt.xlabel('Predicted labels')
    plt.ylabel('True labels')
    plt.title('Confusion Matrix')
    plt.show()

The best model after training is the Random Forest with an F-score of 0.91

If my model predicts 0 for a label of 1 it means it predicts that the user won’t buy within the next 2 hrs which means we would spend for a person who anyway going to end up purchasing this wastes our money so the FN rate should be very minimal
If my model predicts 1 for a label of 0 it means it predicts that the user would buy within the next 2 hrs which means we would miss him out on sending offers and coupons therefore we are missing our targets so the FP rate should also be minimal

Business Objective: It's okay to miss some people but we should not waste money on pursuing the wrong customer so the FN rate should be minimized heavily compared to the FP rate

From the above confusion matrix, it is observed that Random forest minimizes both FN and FP to a greater extent so let's choose it for training

Checkout the API at http://3.94.103.223:5000/

Check out the complete project with CI/CD pipelines at https://github.com/prxdyu/customer_propensity_modelling

Check out the notebooks for training and RFM modeling at https://github.com/prxdyu/customer_propensity_modelling/tree/main/notebooks