Estimated Time Delivery Prediction (Study Case: Ecommerce Dataset)

Katarina Nimas Kusumawati
13 min readAug 22, 2021

--

Reported from the article Aulia, et al with respondents are students of the Faculty of Economics, Islamic University of Borneo Muhammad Arsyad Al-Banjari (UNISKA) Banjarmasin with a total of 100 respondents, in which the article contains a service quality factor as one of the determinants of someone buying goods in e-commerce or no. Based on the survey results, 68% of respondents agree with the services provided by the seller. Things that can be done to improve the quality of customer satisfaction are friendly and polite communication, fast response to consumers, and on-time delivery. One of the factors that we want to review is the timeliness of delivery. Based on the survey, as many as 66% of respondents agree with the services provided on time. Good seller service in sending goods and on time according to the specified delivery estimate can increase customer satisfaction in shopping¹.

Purpose

Knowing whether the product sent to the customer has arrived on time and how the customer responds in the form of a rating.

Hypothesis

· 80% orders sent on time.

· Late delivery will affect customer satisfaction

Is the product delivered on time? If not, does late delivery affect customer satisfaction?

By using the calculation assumption whether an order arrives extremely early, early, on time (normal), late, extremely late, or has not arrived.

If an order doesn’t have a delivered date key then the order has not arrived.

If the order is delivered more than 2 weeks to 2 weeks than estimated, the order is classified as extremely early.

If the order is delivered less than 2 weeks to 1 week earlier than estimated, the order is classified as early.

If the order is delivered less than 1 week earlier to the D day of the estimate, the order is classified as normal (on time).

If the order is delivered 1 day to 1 week late from the estimate, the order is classified as late.

If the order is delivered more than 1 week later than estimated, the order is classified as extremely late.

Then it is found that:

  1. Most orders were shipped very earlier than the estimated, 42,070 orders. Only 13,98% of orders that arrived on time (13796 out of 98666). This phenomenon is quite strange because it means the distance between the estimated delivery time and the actual delivery time is still too far.
Estimated Delivery Time

2. Most orders get a perfect rating (5), which is 58.91% of the total orders.

Only shows orders that have the order status Delivered and have arrived.

Rating Overall

3. Most of the orders were sent to customers, which was 97.78% (96478) of the total orders.

Overall Delivery Status

From the visualization, there are 2 categories of orders that need to be explored more deeply, late and very late.

Late Orders

On late orders found the fact that:

  1. total number of late orders was 3.72% (3.672) orders from all total orders.
Total Late Orders

2. All late orders have been delivered to customers.

Delivery Status of Orders Sent Late

3. Late orders that get a rating of 1 are 41.91% of the total late orders.

Only display orders that have the order status Delivered.

Rating of Orders Sent Late

4. The highest percentage of orders that are late orders from the Kepulauan Bangka Belitung North Maluku. With a delay percentage of 50% of the total orders sent.

Only display orders that have the order status Delivered.

The Province with the Most Percentage of Orders Delivered Late

Extremely Late Orders

On extremely late orders found the fact that:

  1. Total orders that are extremely late are 2.90% (2.683) orders of all total orders.
Total Orders Extremely Late

2. On extremely late orders, 99.97% (2.862) orders were delivered and 0.03% (1) order was canceled.

Delivery Status of Orders Sent Extremely Late

3. Extremely late orders that get a rating of 1 are 69.88% of the total very late orders.

Only display orders that have the order status Delivered.

Rating of Orders Sent Extremely Late

4. The highest percentage of orders that are extremely late orders from Aceh-Jambi, North Kalimantan-Bali, Kepulauan Bangka Belitung — Southeast Sulawesi, and West Papua-West Sulawesi where all orders are delivered extremely late.

Only display orders that have the order status Delivered.

The Province with the Most Orders Sent Very Late

Conclusion:

1. A total of 42,64% (42.070) orders arrived extremely early.

2. Only 13,98% of orders arrived on time (13.796 out of 98.666 orders)

3. A total of 97,78% (96.478) orders were successfully delivered.

4. Most orders get a perfect rating (5), which is 58.91% of the total orders.

5. For late orders, most of the rating is 1 with a total of 41.91%.

6. For orders that are extremely late, the highest rating obtained is rating 1 with a total of 69.88%, so orders that arrive late and very late affect customer satisfaction.

7. Orders that are late and extremely late happened in regions outside Java.

Suggestion:

· Adjust the estimated delivery time so that the distance is not too far from the original delivery date.

· Provide notification to customers that the ordered goods are expected to arrive late.

Adjust Estimated Delivery Time

In the visualization before, we can see that we can’t predict the delivery time properly and caused less customer satisfaction. In this case, we want to remodel the estimated time delivery using some models. I will using Python to predict the delivery Time.

Metrics

We use this metrics for regression modelling.

Directed Error

Metric that preservers the magnitude of the target and the direction of the errors made.

Directed Error

Mean Absolute Error

Average of the absolute difference between the actual and predicted values in the dataset.

Mean Absolute Error

Mean Squared Error (MSE)

Average the squared difference between original and predicted values in the dataset.

Mean Squared Error

Root Mean Squared Error (RMSE)

Square root of Mean Squared Error.

Root Mean Squared Error

Data Preparation

Import Library

First, we need to prepare the library that we will be used.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

numpy: library provided by Python to facilitate computational operations of numeric data types.

pandas: library in Python provide data structure and data analysis.

matplotlib: multiplatform data visualization library built on NumPy array arrays.

seaborn: Python data visualization library based on matplotlib.

Load and Overview the Data

I only used dimension customer, dimension seller, fact orders, and dimension products

Dimension Customer

Dimension Customer
Dimension Customer

Dimension Seller

Dimension Seller
Dimension Seller

Dimension Products

Dimension Products
Dimension Products

Fact Order

Fact Order

Preprocessing Data

This is flow of the preprocessing the data:

Preprocessing Data

Left Join Between Fact Order and Dimension Seller.

orders_sellers = pd.merge(df_orders, df_sellers, on = 'seller_key', how = 'left')

Left Join again with Dimension Customer

orders_join = pd.merge(orders_sellers, df_customers, on = 'customer_key', how = 'left')

Filter Out Orders with Multiple Sellers

Because there are orders that have more than one seller and are located in different locations, but there is only one delivered date. The total order that has one seller in each order is 98,163.

df = orders_join.groupby('order_id').nunique()single_order = pd.Series(df[df['seller_key'] == 1].index)filtered_orders = pd.merge(orders_join, single_order, how = 'inner')

Drop Rows with Missing Values

filtered_orders = filtered_orders.dropna()

Left Join Between Fact Order and Dimension Products

df_pz = df_orders[[‘order_id’, ‘product_key’]].merge(df_products[[‘product_key’, ‘product_weight_g’, ‘product_volume_cm3’]],
on = ‘product_key’,
how = ‘left’)

Aggregate with sum using order_id

orders_size_weight = df_pz.groupby(‘order_id’, as_index = False).sum()[[‘order_id’, ‘product_volume_cm3’, ‘product_weight_g’]]

Left Join with processed Dimension Products

filtered_orders = pd.merge(filtered_orders, orders_size_weight, on = ‘order_id’, how = ‘left’)

Build Feature & Target

Build Target

The target is waiting time. Since we want to predict proper waiting time. To calculate the wait time is delivered_date-orders_date.

get_time = ['order_date_key', 'delivered_date_key', 'estimated_time_delivery_date_key']for column in get_time:
filtered_orders.loc[:, column] = pd.to_datetime(filtered_orders[column], format='%Y%m%d')
filtered_orders.loc[:, 'wait_time'] = (filtered_orders['delivered_date_key'] - filtered_orders['order_date_key']).dt.days

This is the distribution of the target:

Wait Time

Build Feature

Select the possible feature and add an additional column that is the count for visualization.

data = filtered_orders[['order_id', 'order_date_key', 'delivered_date_key',  'customer_state', 'seller_state', 'product_volume_cm3', 'product_weight_g',  'wait_time','estimated_time_delivery_date_key']]data.loc[:, 'count'] = 1

Rename the columns

data.columns = ['order_id', 'order_date', 'delivered_date', 'customer_state', 'seller_state', 'product_volume_cm3', 'product_weight_g', 'wait_time', 'estimated_time_delivery', 'count']

Feature Engineering

day_of_week : The day of the week based on order date (0–6, 0 = Monday)

month : The month of that date based on order date (1–12, 1 = January)

year : The year of that date based on order date

data.loc[:, ‘day_of_week’] = data.order_date.dt.dayofweek
data.loc[:, ‘month’] = data.order_date.dt.month
data.loc[:, ‘year’] = data.order_date.dt.year

Final Feature

Rows marked in orange are features obtained from feature engineering.

Feature

Correlation Between Target and Feature

In this heatmap, the highest correlation feature with the target is product_weight_g.

Heatmap Correlation

Modelling

Preprocessing Categorical Features

Categorical features include customer_state, seller_state, year, order_month, order_day. Preprocessing is using get_dummies

get_dummies is to convert categorical variable into dummy/indicator variables.

categorical_features = ['customer_state', 'seller_state', 'year', 'month', 'day_of_week']X = pd.get_dummies(X, columns = categorical_features)

Preprocessing Numerical Features

Numerical features include product_volume_cm3 and product_weight_g. Preprocessing is using MinMaxScaler.

MinMax Scale
from sklearn.preprocessing import MinMaxScaler
numerical_features = [‘product_volume_cm3’, ‘product_weight_g’]
X.loc[:, numerical_features] = MinMaxScaler().fit_transform(X[numerical_features])

Linear Regression without Hyperparameter Tuning

First I used from the simplest model. Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable that you want to predict is called the dependent variable.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
reg = LinearRegression()y_reg_predcv = cross_val_predict(reg, X, y, cv=10)

Linear Regression with Hyperparameter Tuning

If you want to use another parameters, you can see the note in sklearn linear regression.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
reg = LinearRegression()# Define the hyperparametersreg_hyperparameters = dict()
reg_hyperparameters['fit_intercept'] = [True, False]
reg_hyperparameters['normalize'] = [True, False]
reg_hyperparameters['n_jobs'] = [-1, 1]
search_reg_hyperparameters = GridSearchCV(reg, reg_hyperparameters, cv=10)
reg_result = search_reg_hyperparameters.fit(X, y)# Resultprint('Best Score %s' % reg_result.best_score_)
print('Best Hyperparameters %s' % reg_result.best_params_)
Best Hyperparameters Linear Regression

Ridge Regression without Hyperparameter Tuning

Ridge regression is a model tuning method used to analyze any data that is subject to multicollinearity.

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_predict
rdg = Ridge()y_rdg_predcv = cross_val_predict(rdg, X, y, cv=10)

Ridge Regression with Hyperparameter Tuning

If you want to use another parameters, you can see the note in sklearn ridge regression.

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
rdg = Ridge()# Define the hyperparameters
rdg_hyperparameters = dict()
rdg_hyperparameters['solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
rdg_hyperparameters['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
rdg_hyperparameters['fit_intercept'] = [True, False]
rdg_hyperparameters['normalize'] = [True, False]
search_rdg_hyperparameters = GridSearchCV(rdg, rdg_hyperparameters, cv=10)
rdg_result = search_rdg_hyperparameters.fit(X,y)
#Resultprint('Best Score: %s' % rdg_result.best_score_)
print('Best Hyperparameters: %s' % rdg_result.best_params_)
Best Hyperparameters Ridge Regression

Lasso Regression without Hyperparameter Tuning

Usually used when we have more features because it automatically performs feature selection.
Lasso can set some coefficients to zero, thus performing variable selection, whereas ridge regression cannot.

from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_predict
lsr = Lasso()y_lsr_predcv = cross_val_predict(lsr, X, y, cv=10)

Lasso Regression with Hyperparameter Tuning

If you want to use another parameters, you can see the note in sklearn lasso regression.

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lsr = Lasso()#Define the hyperparameters
lsr_hyperparameters = dict()
lsr_hyperparameters['selection'] = ['cyclic', 'random']
lsr_hyperparameters['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
lsr_hyperparameters['fit_intercept'] = [True, False]
lsr_hyperparameters['normalize'] = [True, False]
search_lsr_hyperparameters = GridSearchCV(lsr, lsr_hyperparameters, cv=10)
lsr_result = search_lsr_hyperparameters.fit(X,y)
#Resultprint('Best Score: %s' % lsr_result.best_score_)
print('Best Hyperparameters: %s' % lsr_result.best_params_)
Best Hyperparameters Lasso Regression

Elastic Net Regression without Hyperparameter Tuning

Elastic Net first emerged as a result of criticism of the lasso, whose variable selection can be data-dependent and therefore unstable. The solution is to combine ridge and lasso regression penalties to get the best of both worlds.

from sklearn.linear_model import ElasticNet
from sklearn.model_selection import cross_val_predict
enr = ElasticNet()y_enr_predcv = cross_val_predict(enr, X, y, cv=10)

Elastic Net Regression with Hyperparameter Tuning

If you want to use another parameters, you can see the note in sklearn elastic net regression.

from numpy import arange
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
enr = ElasticNet()# Define the hyperparameters
enr_hyperparameters = dict()
enr_hyperparameters['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
enr_hyperparameters['l1_ratio'] = arange(0, 1, 0.5)
search_enr_hyperparameters = GridSearchCV(enr, enr_hyperparameters, cv=10)
enr_result = search_enr_hyperparameters.fit(X,y)
#Result
print('Best Score: %s' % enr_result.best_score_)
print('Best Hyperparameters: %s' % enr_result.best_params_)
Best Hyperparameters Elastic Net Regression

Random Forest Regression without Hyperparameter Tuning

The decision tree classifies a data sample whose class is not yet known into existing classes. The use of decision trees in order to avoid overfitting a data set when achieving maximum accuracy.

Random forest is a combination of each good tree which is then combined into one model. Random Forest depends on a random vector value with the same distribution in all trees where each decision tree has a maximum depth.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict
rfr = RandomForestRegressor()y_rfr_predcv = cross_val_predict(rfr, X, y, cv=10)

Random Forest Regression with Hyperparameter Tuning

Tuning using random forest is quite long, especially if the data used is quite large. Despite having many hyperparameters, we need to select which ones are important in determining the quality of the model. Here I want to increase the predictive power so I use the following parameters.

max_features: the maximum number of features a random forest considers dividing a node. If “sqrt”, then max_features=sqrt(n_features).

n_estimators: the number of trees the algorithm creates before taking the maximum vote or taking the average of the predictions. In general, a higher tree count improves performance and makes predictions more stable, but also slows down the computation.

min_samples_leaf: determines the minimum number of leaves required to divide the internal node.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
rfr = RandomForestRegressor()# Define the hyperparameters
rfr_hyperparameters = dict()
rfr_hyperparameters['n_estimators'] = [100, 200]
rfr_hyperparameters['max_features'] = ['auto', 'sqrt']
rfr_hyperparameters['min_samples_leaf'] = [1, 5, 10]
search_rfr_hyperparameters = GridSearchCV(rfr, rfr_hyperparameters, cv=10)
rfr_result = search_rfr_hyperparameters.fit(X,y)
#Result
print('Best Score: %s' % rfr_result.best_score_)
print('Best Hyperparameters: %s' % rfr_result.best_params_)
Best Hyperparameters Random Forest Regression

Result

Compared to original data, MAE on original data has a bigger error than modeling using random forest with hyperparameter tuning. MAE is a more natural measure of mean error, and (unlike RMSE) is unambiguous. Therefore, the dimensionless evaluation and comparison between the mean error of the model performance should be based on MAE. RMSE does not describe the mean error alone and has other implications that are more difficult to parse and understand.

If it is based on directed error, the average prediction of the original data tends to be eleven days earlier, whereas if you use a random forest regressor, the average prediction result is only 0–1 day.

Result

Count Number of Shipment Status Difference

Prior to modelling, many shipments arrived too soon than estimated. After modelling, 49,5% of deliveries arrived on time. This means that on the new model, they can estimate arrival times better than the previous model.

Original Data
Random Forest Regression with Hyperparameter Tuning

Feature Importance

The random forest algorithm can measure the relative importance of each feature on the prediction. Sklearn provides a tool for this that measures a feature’s importance by looking at how much the tree nodes use that feature, is the feature contributes enough. It computes this score automatically for each feature after training and scales the results so the sum of all importance is equal to one.

Feature Importance

The most influential feature in this modelling is product_volume_cm3 or volume of the product. The feature importance that is much higher than other features.

Thank you for reading!

References:

[1] Aulia, G., H. & S., n.d. Faktor-faktor yang Mempengaruhi Keputusan Pembelian Online Melalui Aplikasi Shopee.

--

--

Katarina Nimas Kusumawati

Sometimes I struggle with data, sometimes I just wanna be a Pikachu