Shipping Duration Prediction

8 min readAug 18, 2021

Objective

Problem

Estimating the delivery date is slightly tricky and a certain balance needs to be maintained to ensure the order is not late while also quoting as competitive of a lead time as possible.

Many problem might rise from the lack of precise when estimating delivery duration. Where is My Order (WISMO) is one of the most common problem that e-commerce face. Not only that Return to Origin (RTO) problem also can occur if the estimated delivery duration don’t match well with the actual delivery

Solution

To combat this problem, we need more precise estimation delivery duration. So, I want to build a model to predict the duration for shipping order based on features that were available when user orders. Hopefully the model can outperform the baseline estimator that current ecommerce use and help us gain more customer, thus increasing the amount of revenue that we can get. Not only that, we can also use the model to measure the shipping performance from our ecommerce by calculating the difference between actual and estimated duration in the future.

Data Overview

First of all, we need to understand the data structure for our dataset. The dataset that we use was composed of 95127 rows (order) and 44 columns (features) and built using features from our data warehouse from Part I.

The features related to different aspect of the order in 5 information levels:

package details : containing information about package that were shipped, such as package weight, volume, etc.
order details : containing information about the orders, such as pickup limit date, order date, approved date, order value, etc.
customer details : containing information about the location of customer
seller details : containing information about the location of seller
shipping details : containing information about the shipping process such as shipping cost.

Exploratory Data Analysis (EDA)

To perform modelling, we need to check out our data. Because we have so many features, we only going to explore important feature and feature that need further transformation to increase the value of the features (when inserted into the model).

Our target feature to predict is called wd_actual_delivery_interval which indicates the duration for shipping process.

Target Feature (wd_actual_delivery_interval)

Since we have already process our data in Part I, the data don’t have any missing value in it.

Example of categorical feature (order is weekend)

Feature Engineering

After we explore the data, we are going to create a few features that can help modelling process. These features can be explained below:

pickup point : create cluster from each origin and delivering point so it can be grouped. The intuition comes delivery service in which they usually collect the package from several area and deliver them together.

is same area : check whether the origin and destination is from the same state. The intuition for this feature comes from the difference between same state and different state for delivery duration.

Normalization

To deal with the skewness of many feature, we adopt some normalization strategy to normalize the numerical feature, such as:

Log Transformation
Sqrt Transformation

Both technique performes well on our dataset so we can achieve much more normal distribution on our numerical features. The normalized numerical attributes will help model to understand the impact of the feature to the target variable (duration of shipping). Check out some of transformation result on our dataset.

Modelling

Before modelling, there are some preprocessing that we need to do:

Categorical encoding : Before we can input our data into model, we need to convert categorical feature into numerical feature so it can be processed by model.
Scaling : essential for machine learning algorithms that calculate distances between data, such as linear regression and SVM.

Baseline Modelling (Approach)

Linear regression has been studied at great length, and there is a lot of literature on how your data must be structured to make best use of the model. As such, there is a lot of sophistication when talking about these requirements and expectations which can be intimidating. In practice, you can uses these rules more as rules of thumb when using Ordinary Least Squares Regression, the most common implementation of linear regression.

Here are some of the assumption that we need to use when we model using linear regression :

Linear Assumption. Linear regression assumes that the relationship between your input and output is linear. It does not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).
Remove Noise. Linear regression assumes that your input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output variable and you want to remove outliers in the output variable (y) if possible.
Remove Collinearity. Linear regression will over-fit your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and removing the most correlated.
Gaussian Distributions. Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some benefit using transforms (e.g. log or BoxCox) on you variables to make their distribution more Gaussian looking.
Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.

SVM algorithm is capable for regression task and outlier detection. SVM regression aims to fit as many datapoints as possible between the margins while limiting margin violations (datapoints off the “working day duration”). Related Scikit-Learn functions are: SVR, and linearSVR.

Gradient boosting is a machine learning technique for regression, classification and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea is to set the target outcomes for this next model in order to minimize the error.

Evaluation

We use RMSLE in this case to punish more on error that makes the order late (predicted — actual < 0). This could happen because RMSLE perform Biased Penalty in which it will incurs a larger penalty for the underestimation of actual variable than overestimation. The RMSE metric used to show the estimate difference between actual and predicted duration so we have a picture of how our model would perform in real world.

Tuning

From previous result, we can see that Catboost Regressor perform better than all other models. So, in order to maximize the performance of the models we will try to tune the models with Hyperopt.

There are 3 hyperparameter that we will tune, such as:

learning rate (0.05, 0.1) : hyperparameter to control how fast the model learn. Important for controlling the overfitting factor of the model
depth (3, 5, 7) : Tree depth for each tree in gradient boosting model
l2_leaf_reg (1,3,5,7) : l2 regularization coef

Using all data, we try to compare our baseline model vs tuned model. The result can be seen below :

Baseline vs Hyperopt Tuned Catboost Regressor

We could see that our tuned model a little bit better than our baseline model. This is not surprising, considering Catboost model usually have a very high result even on default parameter.

Feature Importance for Tuned Catboost Regressor

From the feature importances, we can see features such as shipping cost, distance, pickup limit and package weight were the most affecting features for estimating the duration of a shipping process. It’s actually make sense, because most of the time, the more the shipping cost were for an (still need to determine from the distance perspective), the more faster the shipping process were. Not only that, some package that were light can be sent faster than package that were heavy.

Conclusion

This task represented a great challenge, especially because of the amount of data and different features to be considered. From the result, we can see that most of the time, our model will have errors around ~4 days from actual. While it might affected by outlier (RMSE problems), it still shows that our model won’t be able to be used to help our customer.

But, the model that we have created can be used as a performance indicator for each shipping process to determine how good a single order shipping, so we can keep improving our shipping process.

Not only that, there is plenty of space for improvements in the project, such as:

Try to track the shipping department (company) that were being used to help the shipping process (if using 3rd party shipping service)
Add more feature to track the progress of the package when shipped.