M5 Forecasting — Accuracy | Kaggle

Apurv Palekar
7 min readSep 9, 2021

--

A case study for predicting unit sales of Walmart goods.

Table of Contents:

  1. Business Problem.
  2. Source of Data and Data overview.
  3. ML formulation of Business Problem.
  4. Custom Performance Metric.
  5. Exploratory Data analysis.
  6. Feature Engineering.
  7. First cut ML models.
  8. Custom stacking regressor model.
  9. Model performances and Kaggle scores.
  10. Model deployment
  11. Future work
  12. References

1. Business Problem

For retail businesses, if they know in advance how many units are going to be sold in the coming days, it can help them maximize their profits. This forecasting can be done using machine learning techniques.

The Makridakis Open Forecasting Center(MOFC) at the university of Nicosia conducts cutting-edge forecasting research and provides business forecast training. In the 5th iteration of competition hosted on Kaggle we have to predict unit sales of various products sold in the USA by Walmart for the forecasting horizon of 28 days using historic sales data.

2. Source of Data and Data overview

Data is available on Kaggle page for the competition in csv file format.

Data consists of unit sales of various products sold at Walmart in 3 states of USA — California, Texas and Wisconsin over a period of around 5 years from 2011–01–29 to 2016–06–19. It is organized as grouped time series in following format:

A total of 42,840 time series can be formed using combinations of state, store, category and department. These are listed in below table:

calendar.csv: Contains information about date and any special events.

sell_prices.csv: Contains store-wise information about products and their prices.

sales_train_validation.csv: Contains historical sales data for products along with department, category, store and state details for d_1-d_1913.

sales_train_evaluation.csv: Contains historical sales data for products along with department, category, store and state details for d_1-d_1941.

sample_submission.csv: The correct format for submissions.

3. ML formulation of Business Problem

The given forecasting task can be posed as a machine learning task and solved using ML algorithms.

For this we will 1st need to re-frame data as supervised dataset and predict the units sold on a particular day using ML regression models.

4. Custom Performance Metric:

I used a custom metric function, ARMSE(Asymmetric RMSE) based on Root mean squared error based on following scenario:

In case of under prediction of unit sales, it would directly account to business losses as compared to over prediction. Hence under prediction needs to punished more.

where x = units_sold_pred — units_sold,

x: positive for over prediction and negative for under prediction

1. For under prediction, x/abs(x) is -1 so (1-(x/abs(x)))/2 terms is selected which has higher value than other term

2. For over prediction, x/abs(x) is 1 so (1+(x+abs(x)))/2 is selected which has lower value than other term

5. Exploratory Data analysis

  1. California has recorded highest sales followed by Texas and Wisconsin.
  2. FOODS category has highest sales at all stores followed by HOUSEHOLD and HOBBIES has minimum sales.
  3. At all stores FOODS_3 department in FOOD category has maximum sales followed by FOODS_2. FOODS_1 has minimum sales.
  4. HOUSEHOLD_1 department has more sales than HOUSEHOLD_2.
  5. HOBBIES_1 department has more sales than HOBBIES_2.
  1. Total sales have steadily increased up to 2013. From 2013 to 2014 there was a slight drop in sales followed by a increase in 2015.
  2. Total sales for 2016 are lowest as the sales data provided for 2016 is till June 19 only.
  3. For 2011, January sales are very low as compared to other months. For rest of the months, sales are almost equal with some months recording slightly more sales and some slightly less.
  4. The 3rd quarter i.e. July-September have highest sales among all quarters.
  5. March has recorded highest sales in 1st quarter of every year.
  1. California sales have constantly increased whereas Texas sales are almost constant over the years. Wisconsin sales have a 2 sharp increases with sales being almost similar to Texas for most time period.
  1. No 2 stores in California have similar sales pattern. Sales in CA_3 are maximum followed by CA_1 and CA_2. CA_4 sales are minimum. Difference in sales among stores can be because CA_3 being located in urban/faster growing area as compared to other stores.
  1. TX_2 initially had higher sales than other 2 stores. TX_1 and TX_3 have almost similar sales pattern initially. Later TX_3 sales have increased and that of TX_2 decreased and these 2 have similar sales trends towards end.
  1. There is not much difference between sales patterns among Wisconsin stores indicating that they are located in equally populated areas/similar tier cities.
  1. Average daily units sold on a snap day is more than average daily units sold on a non snap day in all 3 states with the difference being highest in Wisconsin followed by Texas and California.
  1. Average daily units sold have a weekly pattern with sales being highest on weekend — Saturday and Sunday.
  2. The sales drop down during working days of the week and increase towards end of work week i.e. Friday.
  1. Sporting events have maximum average sales followed by Cultural and Religious. National events have minimum average sales. This can be as people like watching a game with their friends while snacking on their favorite food items. For cultural and religious events as well preparing special foods and decorating house is part of celebration.

6. Feature Engineering

Lag features: For M5, while predicting sales for d_1942 and onwards, previous sales data i.e. from d_1941,d_1940 can be used as features. Such features are called lag features with lag value 1 and 2 respectively.

Rolling window features: In rolling window technique, we consider previous values within a fixed window size and take mean, median, sum, max or min of them. This value then becomes one of the features. For M5, average number of units sold for last 7 days for a given product can be a rolling window feature

7. First cut ML models

Considered d_1000 onwards data for training so as to speed up data processing and optimize memory consumption.

Decision Tree Regressor

Decision tree regressor builds a regression model in the form of tree structure. The internal nodes of tree are conditions applied on features from data and leaf nodes contain value of target variable

Best hyper parameters: max_depth = 11, min_samples_split = 3

Random Forest Regressor

Random forest is a ensemble technique. Idea is to use multiple decision tree base models and combine their output rather than using single decision tree.

Best hyper parameters: n_estimators = 64, max_depth = 12, min_samples_split = 6

XGBoost Regressor

XGBoost is a gradient boosting based ensemble technique with decision tree as base models.

Best hyper parameters: learning_rate = 0.140999404993131, n_estimators = 56, max_depth = 9

CatBoost Regressor

Catboost is a gradient boosting framework that uses tree based learning algorithms.

Best hyper parameters: learning_rate = 0.55, depth= 3

LightGBM Regressor

Light GBM is a gradient boosting framework that uses tree based learning algorithms.

Best hyper parameters: learning_rate = 0.26661528358747144, n_estimators = 90, max_depth = 8

8. Custom stacking regressor model

Stacking regression is an ensemble learning technique to combine multiple regression models via a meta-regressor. The individual regression models are trained based on the complete training set; then, the meta-regressor is fitted based on the outputs — meta-features — of the individual regression models in the ensemble.

Regression models I have used are the first cut ML models. Metamodel is Linear regressor.

9. Model performances and Kaggle scores

LGBM Regressor has the lowest RMSE and ARMSE(custom metric) scores. Still I submitted all model outputs as the competition evaluation metric is different. I got the best score with Custom stacking regressor.

With custom stacking regressor I got score of 0.66622 which would be placed at 380 on private leaderboard i.e. below 7 percentile.

10. Model deployment

I have created a streamlit app which takes (ITEM_ID)_(STORE_ID) as input and outputs the predictions.

I have hosted this streamlit app using Heroku on below link:

http://m5-streamlit.herokuapp.com/

Here’s a video explaining working of model:

11. Future work

Following approaches can be used for this problem:

  1. Neural network with appropriate custom loss function.
  2. Using Hierarchical Time series Reconciliation technique as input data is time-series data.

12. References

The case study is available on Github. You can connect me on my Linkedin profile.

--

--