How do the customers react to Black Friday Sale?

Tarun Kumar
8 min readFeb 4, 2022

--

A machine Learning approach to Predicting Sales!

In this project, we are going to predict how much the customers will spend during Black Friday, using various features such as age, gender, marital status. The dataset we are going to use is the Black Friday dataset from Kaggle which contains about 550068 rows and 12 features that can be downloaded here. We will follow all the steps of a Data Science lifecycle from data collection to model deployment.

Photo by Claudio Schwarz on Unsplash

Photo by Markus Spiske on Unsplash

Motivation:

Predicting customer behavior is one of the most popular applications of Machine Learning in various fields like Finance, Sales, Marketing. Building such predictive models, we can predict the impact of the decisions taken on the growth of our organization.

Understanding the problem:

Before we start, it is important that we understand the problem so that we can easily select the type of algorithms which can be applied to learn from the dataset. The dataset contains the labels which we have to predict which is the dependent feature ‘Purchase’. Also, the data type of this feature is continuous. So the problem we have is a Supervised Regression type.

Step 0: Import libraries and dataset

All the standard libraries like numpy, pandas, matplotlib, and seaborn are imported in this step. We use numpy for linear algebra operations, pandas for using data frames, matplotlib, and seaborn for plotting graphs. The dataset is imported using the pandas command read_csv().

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns# Importing dataset
train = pd.read_csv('train.csv')

Step 1: Descriptive analysis

# Preview dataset
train.head()
Dataset Preview
# Dataset dimensions - (rows, columns)
print('Rows: {} Columns:{}'.format(train.shape[0],train.shape[1]))Output:
Rows: 550068 Columns: 12# Features data-type
train.info()Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category_1 550068 non-null int64
9 Product_Category_2 376430 non-null float64
10 Product_Category_3 166821 non-null float64
11 Purchase 550068 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB# Statistical summary
train.describe().T
Dataset Description

# Checking for Null values
round((train.isnull().sum()/train.shape[0])*100,2).astype(str)+ ' %'

Output:

User_ID 0.0 %
Product_ID 0.0 %
Gender 0.0 %
Age 0.0 %
Occupation 0.0 %
City_Category 0.0 %
Stay_In_Current_City_Years 0.0 %
Marital_Status 0.0 %
Product_Category_1 0.0 %
Product_Category_2 31.57 %
Product_Category_3 69.67 %
Purchase 0.0 %
dtype: object# Checking the counts of unique values
round((train['Age'].value_counts(normalize = True).mul(100)), 2).astype(str) + ' %'

Output:

26-35 39.92 %
36-45 20.0 %
18-25 18.12 %
46-50 8.31 %
51-55 7.0 %
55+ 3.91 %
0-17 2.75 %
Name: Age, dtype: object# Checking the counts of unique values
round((train['Stay_In_Current_City_Years'].value_counts(normalize = True).mul(100)), 2).astype(str) + ' %'

Output:

1 35.24 %
2 18.51 %
3 17.32 %
4+ 15.4 %
0 13.53 %
Name: Stay_In_Current_City_Years, dtype: object

Observations:

1. The feature ‘Product_Category_2’ contains 31.57% null values which can be imputed whereas ‘Product_Category_3’ contains 69.67% null values so we can drop this feature.

2. The features ‘Age’ and ‘Stay_In_Current_City_Years’ contain some values which have ‘+’ in them which need to be replaced.

Step 2: Exploratory Data Analysis

2.1 Univariate Analysis:

2.2 Bivariate Analysis:

2.3 Multivariate Analysis:

Observations:

1. An interesting observation can be made from the gender distribution plot that the number of females was less than the number of men who shopped during Black Friday.

2. From the correlation heatmap, we can observe that the dependent feature ‘Purchase’ is highly correlated with ‘Product_Category_1’ and ‘Product_Category_2’.

Step 3: Data preprocessing

The ‘+’ value in ‘Age’ and ‘Stay_In_Current_City_Years’ needs to be fixed which can be done by using the .replace() command.

train['Age'] = train['Age'].apply(lambda x:str(x).replace('55+', '55'))train['Stay_In_Current_City_Years'] = train['Stay_In_Current_City_Years'].apply(lambda x : str(x).replace('4+', '4'))train['Stay_In_Current_City_Years'] = train['Stay_In_Current_City_Years'].astype('int')

The features ‘User_ID’ and ‘Product_ID’ are irrelevant and these features need to be dropped. The feature ‘Product_Category_3’ contains 69.67 % null values and hence it also needs to be dropped.

train.drop(['User_ID', 'Product_ID', 'Product_Category_3'], axis = 1, inplace = True)

‘Age’, ‘Gender’, and ‘City_Category’ are the discrete object features in our dataset which need to be encoded for further use. This can be done using the Label Encoder from sklearn’s preprocessing library.

from sklearn.preprocessing import LabelEncoderlabel_encoder_gender = LabelEncoder()
train['Gender'] = label_encoder_gender.fit_transform(train['Gender'])
label_encoder_age = LabelEncoder()
train['Age'] = label_encoder_age.fit_transform(train['Age'])label_encoder_city = LabelEncoder()
train['City_Category'] = label_encoder_city.fit_transform(train['City_Category'])

The feature ‘Product_Category_2’ contains 31.57% of null values which can be easily fixed by filling them with the median value of the feature.

train['Product_Category_2'].fillna(train['Product_Category_2'].median(), inplace = True)

The dataset is then split into X which contains all the independent features and Y which contains the dependent feature ‘Purchase’.

X = train.drop("Purchase", axis = 1)
Y = train["Purchase"]

We can deal with the curse of multicollinearity by performing Feature Selection. The feature importances can be easily found by using the ExtraTreesRegressor. It tells us that ‘Gender’, ‘City_Category’, and ‘Marital_Status’ are the least significant features in the dataset which are dropped.

from sklearn.ensemble import ExtraTreesRegressor 
selector = ExtraTreesRegressor()selector.fit(X, Y)
feature_imp = selector.feature_importances_for index, val in enumerate(feature_imp):
print(index, round((val * 100), 2))Output:0 0.54
1 2.16
2 5.03
3 0.76
4 2.7
5 0.63
6 75.79
7 12.37X.drop(['Gender', 'City_Category', 'Marital_Status'], axis = 1, inplace = True)

For effective model building, we can standardize the dataset using Feature Scaling. This can be done with StandardScaler() from sklearn’s preprocessing library.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()for col in X.columns:
X[col] = scaler.fit_transform(X[col].values.reshape(-1, 1))

The dataset is split into training data and testing data in the ratio 80:20 using the train_test_split() command.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)Output:
X_train shape: (440054, 5)
X_test shape: (110014, 5)
Y_train shape: (440054,)
Y_test shape: (110014,)

Step 4: Data Modelling

Extreme Gradient Boosting Regressor:

from xgboost import XGBRegressor 
xgb = XGBRegressor(random_state = 42)
xgb.fit(X_train, Y_train)
Y_pred_xgb = xgb.predict(X_test)

Understanding the Algorithm:

Extreme gradient boosting or XGBoost Regressor is a type of ensemble learning technique in which trees are built sequentially such that each tree learns from its predecessors about the residuals. The basic idea behind XGBoost is that we build a model making assumptions about parameters and feature importances. Then we use those conclusions to build a better model learning from the errors of its predecessors and try to reduce it.

XGBoost has a number of tuning parameters that can be set before training. The most common ones a learning_rate, max_depth, subsample, colsample_bytree, n_estimators, objective, and regularization parameters like gamma, alpha, lambda. The learning rate defines the step size, max_depth determines how deeply each tree is allowed to grow, subsample is the percentage of samples used per tree, colsample_bytree defines the number of features, n_estimators is the number of trees and objective determines the loss function. Gamma controls whether a given node will split based on the expected reduction in loss after the split. Alpha controls L1 regularization and Lambda controls L2 regularization.

XGBoost is a popular approach to gradient boosting due to its features such as Regularization which prevents overfitting, Handling sparse data which deals with missing values, Parallel learning for faster computation, and Cache awareness for optimal use of hardware. It is one of the most popular algorithms for building models in competitions and hackathons due to its speed and great performance.

Step 5: Model Evaluation

For evaluating the model we will use two metrics, root mean squared error(RMSE) and R squared score(r2 score). The RMSE is the square root of the variance of the errors. Lower the value of RMSE, better the model. The R squared is a statistical measure of how close the data are to the fitted regression line. Its value ranges from 0 to 1 and higher the value, better the model. R2 score is popularly used to compare various models for prediction.

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_scoreprint("XGB regression:")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_xgb)))
print("R2 score:", r2_score(Y_test, Y_pred_xgb))Output:
XGB regression:
RMSE: 3024.8703086442342
R2 score: 0.6358443502285505

Step 6: Hyperparameter tuning

Every machine learning model has a mathematical model at its core with a number of parameters that need to be learned from the data. Hyperparameters are a special kind of parameter that cannot be learned from data and are fixed before the training begins. In this step, we will select the right hyperparameters for our model which will give us a better prediction.

Hyperparameters can be tuned using either RandomizedSearchCV or GridSearchCV. We will use RandomizedSearchCV for this dataset. RandomziedSearchCV finds the best hyperparameters by searching randomly avoiding unnecessary computation.

from sklearn.model_selection import RandomizedSearchCVmax_depth = [int(x) for x in np.linspace(start = 5, stop = 20, num = 15)]
learning_rate = ['0.01', '0.05', '0.1', '0.25', '0.5', '0.75', '1.0']
min_child_weight = [int(x) for x in np.linspace(start = 45, stop = 70, num = 15)]params = {
"learning_rate" : learning_rate,
"max_depth" : max_depth,
"min_child_weight" : min_child_weight,
"gamma" : [0.0, 0.1, 0.2 , 0.3, 0.4],
"colsample_bytree" : [0.3, 0.4, 0.5 , 0.7]
}xgb_tune = XGBRegressor(random_state = 42)xgb_cv = RandomizedSearchCV(xgb_tune, param_distributions = params, cv = 5, verbose = 0, random_state = 42)xgb_cv.fit(X_train, Y_train)xgb_cv.best_score_
Output: 0.6512707227919969xgb_cv.best_params_
Output:
{'colsample_bytree': 0.7, 'gamma': 0.3, 'learning_rate': '1.0', 'max_depth': 11, 'min_child_weight': 66}

We use the RadomizedSearchCV to find the best values for the parameters ‘learning_rate’, ‘max_depth’, ‘colsample_bytree’, ‘gamma’, and ‘min_child_weight’.

xgb_best = XGBRegressor(colsample_bytree = 0.7, gamma = 0.3, learning_rate = 1.0, max_depth = 11, min_child_weight = 66, verbosity = 0, random_state = 42)xgb_best.fit(X_train, Y_train)Y_pred_xgb_best = xgb_best.predict(X_test)print("XGB regression: ") print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_xgb_best))) print("R2 score:", r2_score(Y_test, Y_pred_xgb_best))Output:
XGB regression:
RMSE: 2985.7374358000807
R2 score: 0.6452055961121277

After hyperparameter tuning the xgboost regressor, we find the best RMSE value to be 2985 and R2 score of 0.64.

Step 7: Model Deployment

For deploying our model we will first build a web application using the Flask micro-framework. This application can be deployed to the web using Heroku. Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. The application found here.

Flask WebApp deployed on Heroku

--

--

Tarun Kumar

Perpetual Learner, Fitness enthusiast, Passionate explorer..