M5-Forecasting-Accuracy

Lokesh Kumar Gupta
TheCyPhy
Published in
16 min readSep 28, 2020

Estimate the unit sales of Walmart retail goods.

Table of content:

  1. Business Problem.
  2. Source of data.
  3. How the problem is solved through Machine Learning.
  4. Exploratory Data Analysis and it’s observation.
  5. Existing approach of the problem.
  6. My first cut approach to solve the problem.
  7. Models explanation.
  8. Comparison of the models in tabular format.
  9. Kaggle submission.
  10. Final pipeline of the problem.
  11. Future work.
  12. References.

1. Business problem:

1.1 Overview:

In today’s ultra-competitive business landscape, everyone wants to increase their revenue. We as a whole heard about strategic planning when to hire new people, but do you really know when exactly you need to expand your team, start your next promotional campaign or launch your new item? For all these things revenue forecasting is done.

So are we predicting future? yes!! but not not so accurately. It is done using Machine learning.

The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices.

The MOFC is well known for its Makridakis Competitions and In this problem, the fifth iteration(M5 forecasting) the hierarchical sales data from Walmart, the world’s largest company by revenue is given, to forecast daily sales for the next 28 days.

1.2 Objective :

The main objective is to estimate or predict the unit sales of Walmart retail goods at stores in various locations for the next 28-days. This estimation certainly helps different companies to increase their revenues.

2. Source of Data:

The MOFC posted the above problem on the Kaggle platform. The data provided in CSV file format. Here is a link below to download the datasets.

Link: M5 Forecasting-Accuracy

2.1 Data overview:

The M5 dataset involves the unit sales of various products sold in the USA, organized in the form of grouped time series.

The overview of how the M5 series are organized is shown in the below figure.

An overview of how the M5 series is organized.

We have been provided with the following CSV files:

  1. calendar.csv: Contains information about the dates on which the products are sold.
  2. sales_train_validation.csv: Contains the historical daily unit sales data per product and store.
  3. sell_prices.csv: Contains information about the price of the products sold per store and date.
  4. sample_submission.csv: The correct format for submissions.
  5. sales_train_evaluation.csv: Includes sales [ d_1 — d_1941] .

3. How the problem is solved through Machine Learning:

The above problem is a time-series data problem and can be solved using Classical Machine Learning techniques to estimate the unit sales to a particular day by using historical sales data.

● The given time series data can be re-framed as a supervised learning dataset by using Feature Engineering techniques and then can apply machine learning algorithms on it.

● It can be solved using a Machine Learning Regression model as input variables are generated features and items sold on a particular date as output variable which belongs to a real number.

3.1 Performance Metrics:

The custom performance metric is chosen for this problem which is Root Mean Squared Scaled Error (RMSSE). It is a variant of the well-known Mean Absolute Scaled Error (MASE). The measure is calculated for each series as follows:

where Yₜ is the actual future value of the examined time series at point t, Ŷₜ the generated forecast, n the length of the training sample (number of historical observations), and h the forecasting horizon.

After estimating the RMSSE for all 42,840-time series, then Weighted RMSSE (WRMSSE) will be calculated as described later in The M5 Competition MOFC, using the following formula:

where wᵢ is the weight of the ith series. A lower WRMSSE score is better.

4. Exploratory Data Analysis and it’s observation:

To solve the problem through machine learning, first, we have to understand about the data which is an important approach is known as Exploratory data analysis. This provides insights into data and the advanced exploration of the data in different visualizations which helps in feature engineering to create appropriate features.

4.1 Reading the data

4.1.2 Basic information about data

Observations:

  1. In the data, there are 3 states, 10 stores, 3 categories, 7 departments, and 3049 items.
  2. a. States are CA, TX, and WI.
    b. Stores are CA_1, CA_2, CA_3, CA_4(CA), TX_1,TX_2, TX_3(TX) and W1_1, WI_2, WI_3(WI).
    c. Categories are FOODS, HOUSEHOLD, and HOBBIES.
    d. Departments are FOODS_1, FOODS_2, FOODS_3(FOODS), HOUSEHOLD_1, HOUSEHOLD_2(HOUSEHOLD) AND HOBBIES_1, HOBBIES_2(HOBBIES).

4.2 Downcasting the data

Downcasting means type-casting of the data and to reduce the amount of storage used by them. As pandas automatically create int32, int64, float32, or float64 columns for numeric ones, we can convert them into int8/int16 and float8/float16. Also, Pandas stores categorical columns as objects which take more storage as compare to category datatype, so convert object datatype to category datatype.

Observations:

  1. The size of all dataframes approximately reduced to one-fourth of the size of the original dataframes.
  2. It reduced the chances of ‘RAM crashed’ error.

4.3 EDA of Sales Data

The first five rows of the sales table

4.3.1 Mean sales

Observations:

  1. CA stores have the highest mean sales among all the stores and CA_3 has the highest sell among all the stores.
  2. CA stores have also the highest variance in mean sales among all the stores.

4.3.2 Rolling Averages sales

Rolling averages for different window sizes 7, 10, 13, 30, and 90 are taken and their scatter plots and box plots are drawn.

Observations:

  1. Every rolling mean sales curve has a ‘linear oscillation’ trend and also has an upward linear trend.
  2. CA_3 has the highest sell.
  3. CA stores have the highest variance in mean sales which may imply that CA has development disparity ie. some places are growing faster than others.
  4. WI and TX do not have much variance which may imply that development is uniform in these states.

4.3.3 Rolling Medians sales

Observations:

  1. Took rolling medians for different window sizes ie. 7, 10, 13, 30, and 90.
  2. The observations are quite similar to rolling averages.

4.3.4 Rolling Minimums sales

Observations:

  1. Took rolling minimums for different window sizes ie. 7,10,13,30 and 90.
  2. The sharp up and down indicates there are days where the unit sales are zero and these days can be holiday days.
  3. The other observations are quite similar to rolling averages.

4.3.5 Rolling Maximums sales

Observations:

  1. Took rolling maximums for different window sizes ie. 7,10,13,30 and 90.
  2. The observations are quite similar to rolling averages.

4.3.6 Rolling 25th Percentiles

Observations:

  1. Took rolling 25th Percentiles for different window sizes ie. 7,10,13,30 and 90.
  2. The observations are quite similar to rolling averages.

4.3.7 Sales on the state(California)

Observations:

  • The average sales in descending order are CA_3, CA_1, CA_2, CA_4.
  • CA stores have a large disparity in sales and sales curves never meet. This may show that there are “Hubs” of development that don’t change after some time.

Observations:

  • Store CA_3 has maximum sales.
  • Store CA_4 has minimum sales

4.3.8 Sales on the state(Wisconsin)

Observations:

  • The average sales in descending order are WI_2, WI_3, WI_1.
  • WI stores have a low disparity in sales and sales curves meet which may show that WI does not have specific ‘hubs’ of development and there is equity in development across the state.

Observations:

  • Store W1_2 has maximum sales
  • Store W1_1 has minimum sales.

4.3.9 Sales on the state(Texas)

Observations

  • The average sales in descending order are TX_2, TX_3, TX_1.
  • TX stores have a very low disparity in sales and even though the sales curves meet but not like in WI.
  • The variance in TX stores is higher as compare to WI which shows that there might be ‘hubs’ of development, but not as in CA stores.

Observations:

  • Store TX_2 has maximum sales
  • Store TX_1 has minimum sales.

4.3.9 Items by category

Observations:

  • FOODS category has the highest number of items.
  • The number of items in descending order are FOODs, HOUSEHOLD, and HOBBIES.

Observations:

  • FOODS category items have a maximum number of sales.
  • For some days sale is nearly zero ie. longer spikes in sales curves, which may show that there is a holiday or item is unavailable on that day.

4.3.10 Take a random item that sells a lot

Observations

  • The sales curve is very erratic which implies that there are so many factors that affect the sales on a given day.
  • For some days the sales curve is flatline which means the item is unavailable.
  • Item is unavailable for many periods of time.

4.4 EDA of sells price data

The first five rows of sell price data

Observations:

  • Item prices distribution is almost uniform in all stores(CA, WI, and TX).
  • Item HOBBIES_1_060 is the most expensive item being sold at Walmarts priced at around 30.53 dollars.
  • In CA_2, CA_3, and CA_4 the most expensive item being sold is HOBBIES_1_361 and for CA_1 the Item HOBBIES_1_060 is the most expensive item being sold.
  • Item HOUSEHOLD_1_060 is the most expensive item being sold at TX priced at around 29.87 dollars.
  • Item HOBBIES_1_225 priced at around 30.51 dollars is the most expensive item being sold at TX_1 and TX_2, while item Item HOBBIES_1_361 priced at around 30.51 is the most expensive item at TX_3.

Observations:

  • Food category items are the cheapest items among Food, Hobbies, and Household items.
  • Household and Hobbies items have nearly the same price range.

4.5 EDA of calendar data

The first five rows of calendar data

Observations:

  • SNAP follows a different pattern in a different state.
  • SNAP is allowed on the first ten days in CA, it follows the pattern 101–011 in TX and follows the pattern 011 in WI.
  • There are total 30 unique events that belong to 5 unique types.
  • There are total 162 not null values in event_name_1 and 5 not null values in event_name_2 for approx 4.5 years, which results that these events occur every year.

5. Existing approach of the problem

Solution 1:

https://www.kaggle.com/anshuls235/time-series-forecasting-eda-fe-modelling

  1. The sizes of the given data are quite large and take a lot of memory, so first downcasting i.e type casting is done. It reduces the amount of storage used by data.
  2. Combined all the data into a single data ie. merge sales, price, and calendar data into a single dataframe.
  3. After combining in a single dataframe, feature engineering is applied and various features are introduced such as label encoding, mean encoding, lags, rolling window means, expanding window means, and trends.
  4. Adding all these features time-based splitting is done.
  5. The LGBMRegressor model is applied to the data and hyperparameter tuning to select the best parameter is not shown.
  6. The above solution learnt different types of feature engineering techniques and how the ensemble model can be applied to this problem that helps to solve this problem more better.

Solution 2:

https://www.kaggle.com/qcw171717/other-naive-forecasts-submission-score/

The above solution used the naive logic to make forecasts by different following approaches and checked the local WRMSSE score for each approach.

  1. The all 0’s approach forecasts the next all sales as 0.
  2. The Average through all history approach takes the average of all historical observations and use them as a forecasted value.
  3. The Mean of previous 10, 20, 30, 40, 50, 60 days approach, takes there respective mean and use them for forecasting.
  4. The Same as previous 28 days approach takes forecast values the same as the previous 28 days values.
  5. The Average of same day for all previous weeks approach takes the forecast value of the same day as the average of same day for all previous weeks.

The local WRMSSE score for each approach as follows:

The Local WRMSSE score for the same previous 28 days approach is best, which will help to get a better score for this problem.

From the above naive approach, it can be concluded that closer time steps give a better score for this problem as compare to faraway time steps. Also moving averages technique can be applied to this problem.

Improvements:

  1. In the above solutions, none of them handle missing values. Missing values comes from sell price data. In sell price data price of some items is not given. So we handle them through mean imputation technique.
  2. Convert data into s single dataframe in such a way none item gets miss from the data.
  3. After converting data into a single dataframe the rows of the data are quite large, take only relevant data.
  4. Add rolling medians features instead of rolling means features because mean is sensitive to outlier.

5. Remove redundant features.

6. My First Cut Approach of the above problem:

  1. As it is a time-series forecasting problem and can be solved using Machine Learning techniques, so first given data are re-framed as a supervised learning single dataset. The sales dataset is melted ie. convert the wide data format to long data format and merged all the datasets ie. data sales, calendar, and sell prices to a single dataframe on “d_” (day number).
  2. Handled the missing values as it comes from sell price data, through mean imputation techniques.

3. Feature engineering is applied and various features are introduced such as lags, label encoding, and rolling medians.

4. Some redundant features are removed such as date and weekday.

5. As the rows of resultant data are quite large so only took rows which have “d_”(day number) is greater than 1050. It reduced the computation time and gave better results.

6. Time based splitting is done and took generated features as input variables and items sold on a particular date as output variable which belongs to the real number.

7. We did hyperparameter tuning & trained Simple Moving Averages, ExtraTressRegression, RandomForestRegression & LgbmRegression & LGBMRegressor turned out to be the best.

7. Models Explanation

Here we used one simple model and three ensemble models.

  1. Simple Moving Averages.
  2. Random Forest Regression.
  3. Extra Trees Regression.
  4. Lgbm Regression.
  5. Simple Moving Averages:
  • Moving averages is a naive and effective technique in time series forecasting. It can be used for data preparation, feature engineering, and even directly for making predictions.
  • On hyperparameter tuning, we found out that window size=28 is the best.
  • Simple Moving Averages model training & prediction:-
  • The local WRMSSE score we got for Simple Moving Averages is 1.06.

2. Random Forest Regression:

  • The Random Forest model is a bagging technique, which uses bootstrap sampling without replacement of every sample to train and it reduces variance while training and prevents from overfitting.
  • On hyperparameter tuning, we found out max_depth=26, n_estimators=31.
  • RandomForestRegressor model training & prediction:-
  • The local WRMSSE score we got for RandomForestRegressor is 0.85.

3. Extra Trees Regression:

The Extra Trees algorithm fits each decision tree on the whole training datasets whereas random forest develops each decision tree from a bootstrap sample of the training datasets.

  • On hyperparameter tuning, we found out max_depth=20, n_estimators=22.
  • ExtraTressRegressor model training & prediction:-
  • The local WRMSSE score we got for ExtraTressRegressor is 0.82.

4. Lgbm Regression.

  • Lgbm Regression is a boosting technique to reduce bias while training the model. It has faster training speed and higher efficiency. It replaces continuous values to discrete bins which result in lower memory usage.
  • On hyperparameter tuning, we found out learning_rate = 0.071, num_leaves = 12 and min_data_in_leaf = 147.
  • LgbmRegression model training & prediction:-
  • The local WRMSSE score we got for ExtraTressRegressor is 0.70.

8. Comparison of the models in tabular format

  • Comparison of All models is as follows:-
  • From the above table, we can conclude that LgbmRegressor is the best model.

9. Kaggle submission

  • Kaggle public & private scores for each model is as follows:-
Kaggle Submission
  • From Kaggle’s public and private scores also, we can conclude that the best performing model for this problem is LgbmRegressor.

There are about 5,558 participants in the competition and my score stands at 159 which stands below 3 percentile in the score.

Kaggle Leaderboard

10. Final pipeline of the problem:

FHere we are going to see about model productionization.

We selected Lgbm model is the best model and in real-time process, a single query data point(raw data) is sent to trained model ie. Lgbm model and forecast the query point in sec.

It happens in the following way:

  1. Select the random data point from the sales data of size 1*1947 as query point and the next 28 forecasting days are filled with zero initially and appended to query point.
  2. The query data point is mungled with other data and convert into machine learning data. After that, it sent for feature engineering and gets the desired features to predict the model.
  3. Finally, data is sent to trained model for prediction.

Input:

output:

11. Future Work

  1. Adding more features will result in a good score.
  2. Implementing Neural Network with defining proper layers and hyper-parameter will result in a good score.

Github Repo

  • If you are interested in this case study or wants to improve it further, then Jupyter Notebook is available with all code at my following repo:-

--

--