Predicting Material Backorders in Inventory Management
This is walkthrough of my case study and my approach on a research paper presented at the IEEE Conference, 2017. The authors of the paper have considered a real-world imbalanced dataset available on Kaggle’s competition Can you predict Product Backorders?.
We will delve deep into how material backorders can be minimized during inventory management. But before that, let us understand what exactly are backorders and how to handle them in a theoretical perspective. Below is the link to the original research paper.
What are backorders?
An order placed for a product that is temporarily out of stock with the supplier. A backorder indicates that the demand for the particular product or a service is more than it’s supply.
Backorders are not to be confused with “out of stock”. In the case of “out of stock” the supply or production of the product may be uncertain. On the other hand, backorder are placed for products which are in planned production that have encountered a lag due to several factors.
In this case study, we will explore those factors in detail and see which ones are the most important cause for a product going into backorder.
What are some general causes for backorders?
Backorders are not essentially bad for a company. It really depends on the inventory and orders are being managed. Some general causes for backorders are listed below.
- Order not promptly placed
- Warehouse discrepancies
- Human error
- Factory shortages
- Inaccurate order points
- Abnormal demand
- Customer convenience
What is the goal of this project?
The goal of this project to minimize backorders by identifying the material at risk of backorder before the event occurs. This gives the business management a suitable time to react and make appropriate changes.
Overview of the dataset and the problem
The dataset for this problem has two classes, positive and negative class. The positive class meaning the product went into backorder and the negative class indicating the opposite. This makes the problem a binary class classification problem. The data is highly imbalanced with a ratio of 1:148 for the positive and negative class respectively for the train set. Majority of the classes are negative i.e, most of the products did not go into backorder. The dataset has 22 features and 1 class label. All the features and thier descriptions are listed below:
- sku: Stock Keeping Unit
- national_inv: Current inventory level of component
- lead_time: Registered transit time
- in_transit_qty: In transit quantity
- forecast_3_month: Forecast sales for the next 3 months
- forecast_6_month: Forecast sales for the next 6 months
- forecast_9_month: Forecast sales for the next 9 months
- sales_1_month: Sales quantity for the prior 1 month
- sales_3_month: Sales quantity for the prior 3 months
- sales_6_month: Sales quantity for the prior 6 months
- sales_9_month: Sales quantity for the prior 9 months
- min_bank: Minimum recommended amount in stock
- potential_issue: Indictor variable noting potential issue with item
- pieces_past_due: Parts overdue from source
- perf_6_month_avg: Source performance in last 6 months
- perf_12_month_avg: Source performance in last 12 months
- local_bo_qty: Amount of stock orders overdue
- deck_risk: General risk flag
- oe_constraint: General risk flag
- ppap_risk: General risk flag
- stop_auto_buy: General risk flag
- rev_stop: General risk flag
- went_on_backorder: Product went on backorder
Below is the link to the dataset:
Existing Solutions
- The original research paper — (PDF) Predicting Material Backorders in Inventory Management using Machine Learning:
This paper employs various under-sampling, over-sampling techniques before fitting a model to curb the imbalanced dataset problem. We see that logistic regression performs the lowest while gradient boosting decision trees perform the highest. Some of the sampling techniques used prior to fitting the model are Random Under Sampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE). In addition to these models, a special case of ensemble learning has been employed which is a combination of random sampling and ensemble models. This is called Blagging and the performance of this model is very close to the gradient boosting model. - Backorder Prediction by Srinivasa Raja | Analytics Vidhya:
The approach followed by the author is very similar to that of the
original research paper. In addition to RUS or SMOTE, the blog shows
additional sampling techniques like Adaptive Synthetic Sampling
(ADASYN), Near Miss Undersampling, Tomek links and more. Some
feature engineering techniques like log transforms and normalization
have been applied to the data before fitting the model. In addition to
AUC, Macro F1 score has been taken into account as an evaluation
metric as there is class imbalance in the data. https://medium.com/analytics-vidhya/backorder-prediction-d4f1c5362f18 - Prediction of probable backorder scenarios in the supply chain
using Distributed Random Forest and Gradient Boosting Machine
learning techniques — Journal of Big Data:
They have used the Distributed Random Forest (DRF) and Gradient
Boosted Machine (GBM) algorithms from H2O.ai. The random forest
model is selected as the baseline model and gradient boosting is the
second model. AUC and confusion matrix are chosen as the
evaluation metrics for the models. Sampling techniques like RUS and
SMOTE are applied prior to fitting the H2O models. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00345-2 - Product backorder prediction using deep neural network on imbalanced data:
From the available overview, it is now evident that we can use neural networks especially fully connected networks in combination with the same sampling techniques like RUS and SMOTE to predict the backorders. The blog states that the deep learning model has outperformed some of the prominent classification models in terms of standard evaluation metrics. https://www.tandfonline.com/doi/full/10.1080/00207543.2021.1901153?scroll=top&needAccess=true
Which performance metrics should be used?
We are going to use accuracy for this case study. However, we know that accuracy is not a good measurement for a highly imbalanced dataset. Therefore, we will employ additional metrics like AUC curve which is specially designed for binary class classification. We are also going to use the confusion matrix along with precision and recall for a better understanding of the model predictions.
Exploratory Data Analysis
We are given a dataset which is already split into train and test set. The train set consists of 1,687,860 data points and the test set consists of 242,075 data points. Both the datasets have 23 features of which went_on_backorder is the target label.
The feature sku is the identifier and therefore, must be dropped from the dataset. Out of the other 21 features (excluding the class label went_on_backorder) we see that 15 of them are numerical and 6 of them are categorical. All the categorical features have either Yes or No which shows that they are binary in nature. The numerical features include national_inv, lead_time, in_transit_qty, forecast_3_month, forecast_6_month, forecast_9_month, sales_1_month, sales_3_month, sales_6_month, sales_9_month, min_bank, pieces_past_due, perf_6_month_avg, perf_12_month_avg, local_bo_qty and the categorical features include potential_issue, deck_risk, oe_constraint, ppap_risk, stop_auto_buy and rev_stop.
Furthermore, we see that the feature lead_time is the only feature with missing values. The percentage of null values in the lead_time feature in train set is 5.98%. In addition, the last row of every feature has null values and therefore, can be removed as dropping one row does not cause any impact on a whole.
From the univariate analysis of each feature with respect to the class label, we see that most of the numerical features are highly skewed towards the positive side.
We see from the above plots that the distribution of in_transit_quantity is a bit similar to national_inv. Both the features are positively skewed.
We know there are missing values for lead_time, and therefore have removed them and plotted the probability density functions. We see that the PDF of lead_time also shows positive skewness.
The bar plots represents an estimate of central tendency (in this case mean). Therefore, from the set of bar plots, we can say that the over a span of 3, 6 and 9 months, the mean forecast sales is decreasing as a whole for the positive class while the mean forecast sales seems to be constant for the negative class.
To understand the distributions and IQRs, we have plotted the box plots and violin plots (violin plots are shown above). We see that the IQRs are not clearly visible. And there are a lot of outliers especially for the negative class for all the 3 features. And the range of the forecast of outliers only seems to increase for the future months. This is expected as the number of orders increase with time. Furthermore, we see that the distributions of all the three features are similar, with all features being extremely positively skewed. We can assume that the data points located towards the tail may not actually be outliers. And this trend is observed across all the numerical features.
When we look are the sales features, we see that the PDFs, box plots and violin plots are very similar to that of the forecast features. To understand the data better, we have removed the entire Q4 for all the four sales features and have plotted count plots respectively. We quickly see that there are a lot of products with no units sold in all the prior months. Data points with at least one unit sold are more compared to data points with at least three units sold for the feature sales_1_month. An extended version of this is true for all the other sales features i.e., data points with at least one unit sold are more compared to data points with at least three or more units sold.
From the count plot of min_back, we can deduce that most of the values tend to be zero and there are very less data points with a min_bank value of three or more.
Furthermore, the features pieces_past_due and local_bo_qty are very similar to national_inv. The PDFs and the box plots show that their distribution is also skewed like that of the earlier features.
From the above figure, we can see that PDFs for the two features perf_6_month_avg and perf_12_month_avg are very similar. We see a Gaussian-like distribution for both features around the zeroth point on the scale. However, the curve extend extremely towards the negative axis indicating negative skewness. From the bar plots, we see that the average source performance over 6 and 12 months is around -3 for the orders that went into backorder and around -6 to -7 for the orders which did not go into backorder. The box and violin plots also indicate that the distribution in negatively skewed and there are a few outliers for both the classes. The median value for perf_6_month_avg and perf_12_month_avg is 0.82 and 0.81 respectively and 90% percent of the points are less than 0.99 for both the features.
From the count plots above, we clearly there are very less number of data points with the risk flags oe_constraint and rev_stop. There are a decent number of data points with deck_risk as Yes and a considerable amount of data points with ppap_risk and stop_auto_buy as Yes. The majority of the data points do not have any risk flags in the train set.
Spearman Rank Correlation Coefficient
From the above heat map, we see that the in_transit_qty, forecast_3_month, forecast_6_month, forecast_9_month, sales_1_month, sales_3_month, sales_6_month, sales_9_month and min_bank are highly correlated with each other. Among them, forecast_3_month, forecast_6_month and forecast_9_month are more correlated with each other compared to the rest. Similarly, sales_1_month, sales_3_month, sales_6_month and sales_9_month are more correlated with each other than any other feature. We also see that perf_6_month_avg and perf_12_month_avg are highly correlated with each other.
Kolmogorov–Smirnov test for numerical features
We are using KS test to understand the correlation between the numerical features and the target label. The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative ditribution function of the reference distribution, or between the empirical distribution functions of two samples.
We can see that most of the feature have very high number of data points at 0. From the ks test for all the numerical feature we can say most of the features do not have a very good p values and thus we will have to reject the null hypothesis. Therefore, these distributions are not similar are do not show much correlation with the target variable. However, some features like lead_time, perf_6_month_avg, perf_12_month_avg show good enough correlation with the target variable.
Stochastic/Probability Matrix for categorical features
From the above set of probaility matrices for all the categorical features we see that most of these categorical features have a very high probability of having a negetive flag when the product did not go into backorder. Therefore, we can say that when a product does not go into backorder, most of the general risk flag are negative.
Principal Component Analysis
We have used dimensionality reduction techniques, in this case Principal Component Analysis to capture the essence of the data. From the above plot we see that most of the data points lie alongside 0. This deduction is true because we have seen many features with mostly 0 values in our EDA. There are outliers in the data but those data points does not have to be outlier per se.
Furthermore, these potential outliers are more of the negative class compared to the positive class. And, for the positive class, almost all of the datapoints lie alongside 0.
Feature Engineering
As there are missing values in the feature lead_time, we have performed mean imputation. Furthermore, we saw that the feature pieces_past_due and local_bo_quantity has more than 95% of values as 0. Therefore, as a feature engineering process we have added additional features which show if each data point in the two features is zero or non-zero.
In addition to these feature, we have imputed all the categorical features with the probabilities from the probability matrices. We have only considered the zero class probability and imputed it’s values across all the features.
We have performed all the above feature engineering techniques for both train and test datasets. The final data has the shape 1687860 rows × 24 columns for the train set and 242075 rows × 24 columns for the test set.
Building the model and evaluation
In this approach, we are not going to use any of the sampling techniques that we used in the original research papers and a few existing solutions. The idea behind this is not to corrupt the data by added synthetic data points nor to reduce the size of the dataset for the model. Therefore, to curb the imbalanced dataset problem, we are using class weights.
As a baseline model, I have chosen Logistic Regression with hyperparameter tuning using scikit-learn’s GridSearchCV. The parameters that were tuned are penalty and learning rate. I have built the best logistic regression model with a learning rate of 0.001 and with an L1 penalty. The accuracy score achieved on the test set is 0.798 while the AUC on the same test set is 0.809.
In this case study, I have tried four machine leanring model including the baseline logistic regression model with a balanced class weight. Hyperparameter tuning has been performed to get the best out of each model. Techniques like grid search and random search have been employed to perform tuning.
We observe that tree based models perform better than linear models. Please check the summary below of all the metrics acorss all the four models i.e, Logistic Regression, Decision Trees, Random Forest and Gradient Boosted Decision Trees.
From the above AUC plots and summary, Random Forest is the best performing model with a AUC of 92.6 which is very close to the AUC (94.7) achieved by the authors of the original research paper. We were able to achieve this score without any of the oversampling or undersampling techniques used in the research paper. Instead, the Random Forest model used balanced sub-sample as the class weight to curb the imbalanced data problem.
The accuracy score for the best model is 93.8 which is good compared to Logistic Regression and Decision Tree. However, the Gradient Boosted Decission Tree model has the highest accuracy. We have not used this model as we have established that accuracy is not the best metric when it comes to highly imbalanced data.
Deployment on AWS using Streamlit
After finalizing the model, I have deployed it on an EC2 instance using the Streamlit API. You can follow the link below to view the app.
Link to the streamlit app: http://34.238.245.11:8501/
Future Work
We can always employ sampling techniques to improve the performance of the model. Techniques like SMOTE and ADASYN have proven to be effective in curbing the imbalanced dataset problem. Future works include usage of other machine learning models like Support Vector Machine and even using Neural Networks in combination with mentioned sampling techniques.
All code related to this case study is hosted on my GitHub profile.
Contact me on my LinkedIn profile
References
- https://www.appliedaicourse.com/
- https://www.researchgate.net/publication/319553365_Predicting_Material_Backorders_in_Inventory_Management_using_Machine_Learning
- https://medium.com/analytics-vidhya/backorder-prediction-d4f1c5362f18
- https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00345-2
- https://www.tandfonline.com/doi/full/10.1080/00207543.2021.1901153?scroll=top&needAccess=true