M5 Forecasting - Accuracy

Estimate the unit sales of Walmart retail goods

Bharanish Kumar

Published in

Analytics Vidhya

16 min readJan 26, 2021

Business/Real-world problem
Problem Statement
Source of data and data Overview
Use of ML on the problem
Performance Metrics
Exploratory Data Analysis
Existing Approach
First-cut Solution
Feature Engineering & Data preprocessing
Models explanation
Models Comparison
Kaggle Submission
Deployment
Future work
References

1. Business problem:

Sales forecasting is the process of estimating future sales either short-term or long-term by using historical sales figures. Forecasting sales is essential for the companies, for example, investing in new methodologies to increase their revenue for their products which may account for low sales in the future through promotions/discounts, etc. Hence for established businesses, it becomes important to forecast future sales accurately.

2. Problem Statement:

With the use of historical sales data from Walmart, the world’s largest company by revenue, we need to forecast daily sales for the next 28 days.

This problem is hosted on Kaggle by the Makridakis Open Forecasting Center (MOFC) at the University of Nicosia which conducts cutting-edge forecasting research. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s. The (M5 Forecasting) is the fifth iteration, in which we use hierarchical sales data from Walmart to forecast daily sales for the next 28 days.

3. Source of Data and Data overview:

Data is downloaded from Kaggle. Source

3.1 Data overview:

The sales data provided from Walmart covers stores in three US states (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events.

High-level M5 series organization overview from the below flowchart.

The data is mainly divided into 3 Files:

Calender.csv - Contains information about the dates the products are sold along with the behavior of the dates (ex like: month, year, dayofweek, any special event on that date).
sales_train_evaluation.csv — Contains the historical daily unit sales ([includes from d_1 — d_1941] i.e., from 2011–01–29 to 2016–05–22.) data per product and store
sell_prices.csv — Contains information about the price of the products sold per store and date.

4. Use of ML on the problem:

The problem we are solving is a time series problem that we can transform into a supervised learning problem by performing Feature engineering on the raw time-series data. Then we can use these features as input variables and ‘sales of the item’ as output variable (Real number) and solve the problem of forecasting using Machine learning Regression models.

5. Performance Metrics:

An often-used metric in Regression-based problems is RMSE. Here we use the custom metric WRMSSE as an evaluation metric for this problem. It is defined as:

where Y_t is the actual future sales value of the examined time series at point t, (Y_t )^ is the generated forecast sales value from the model, n is the length of the training sample (number of historical observations), and h, is the forecasting horizon(28 days).

6. Exploratory Data Analysis:

The first step in approach to solve a machine learning problem is to get insights from data. Let’s start with analyzing the data if there are any missing values to be preprocessed, come up with meaningful conclusions on how a particular feature is impacting our target sales.

6.1. Data overview:

6.2. Total sales across all stores daily and yearly pattern:

Observations:

There is a slight upward trend in total sales from 2011 to 2016 and on 1st Jan of every year, we observe zero sales (we can observe that stores are closed on New year's Eve).
There is a similar pattern of overall sales across years and has yearly seasonality.
From 2012 to 2015 overall sales we observe each 2-month sales having a similar pattern from this we can assume that each year's time series of 2012–2015 is stationary time-series.

6.3. Sales proportion:

Observations:

Out of the three states, California accounts for a high percentage of the total sales, Texas and Wisconsin's total sales are almost equal.
Even though California State accounts for 43.6% of total sales, interestingly 2 of its store's total sales are in bottom 4, CA_3 Store has highest sales and CA_4 store has lowest sales.
The food category has the highest sales proportion and the Hobbies category is the lowest. It can be concluded that people shop more on household needs than their hobbies and food products are essential and so the sales are also high.
The highest sales are from the ‘FOODS_3’ department and the lowest is from the ‘HOBBIES_2’ department. ‘FOOD_3’ department alone has 50% sales proportion and other 6 departments accounts for rest 50% sales. Even though the total sales from the ‘HOBBIES’ category is very less(9.3%), interestingly its department ‘HOBBIES_1’ has a high number of sales compared to the FOODS category’s ‘FOODS_1’ department.

6.4. Sates and Product category Vs sales proportion:

Observations:

All three states have a similar proportion of sales for all three product categories. Food category sales proportion is the same for the states Texas and Wisconsin.
Food category accounts for higher sales and hobbies the lowest for all 3 states. Texas and Wisconsin states have a similar proportion of sales for the Hobbies product category.

6.5. Year and Product category Vs sales proportion:

Observations:

2015 has the highest number of sales and 2016 the lowest number of sales(only 4 months of sales provided).
Of the 3 categories, food category sales are higher in all the years. Food, Household and Hobbies sales are same in the years 2013 and 2014.
There is not a much difference in sales of Food category in the years from 2012–2015.

6.6. Daily total sales per item category:

Observations:

Daily total sales of Food category is highest and has trended upwards (sales increasing over the years)
Daily total sales of Household category also have trended upwards and Hobbies category are the least and remain almost constant over the years.

6.7. Top 10 products which highest sales:

Observations:

We observe that the top 10 products with the highest number of sales are Food products(since Food products are essential).
First, highest sales are for product ‘FOODS_3_090_CA_3’ product belonging to ‘CA_3’ store from ‘California’ state, Second highest sales is for product ‘FOODS_3_586_TX_2’ product belonging to ‘TX_2’ store from ‘Texas’ state, Third highest sales is for product ‘FOODS_3_586_TX_3’ product belonging to ‘TX_3’ store from ‘Texas’ state.
Wisconsin states ‘FOODS_3_090_WI_3’ product belonging to the ‘WI_3’ store is with sixth highest sales.
Interestingly even though California has a higher Walmart sales proportion, but there are 6 products with the highest sales from Texas state in the top 10.

6.8. Missing value analysis:

Observations:

There are no missing values from prices_data.
But here sell_price is the price of the product for the given week/store which is provided per week (average across seven days). If this is not available, that means the product was not sold during the examined week. We need to check for these products with missing prices.
CA_2 store’s items price missing is higher. TX_1 and TX_2 stores have an equal percentage of the missing prices of items.
In all the stores greater than 17% of their item's price is missing.
There are no missing values for all the columns except the events and event type columns. It is understood that if the event_name_1/event_name_2 is NAN i.e., that day is a non-event day.

6.9. SNAP Vs Non-SNAP day Sales proportion on each product category and each state:

Observations:

Sales from SNAP days outperform sales from non-SNAP days in the case of the FOOD product category. The similar number of sales on SNAP or non-SNAP days in the case of Hobbies and Household product categories.
SNAP sales are higher in Wisconsin state and lower in California state. Wisconsin state’s SNAP sales are significantly higher than non-SNAP sales.

6.10. Impact of events on sales:

Observations:

There is increase in sales on sports events compared to National events where there is decrease in sales.
An increase in sales is high on sporting event days and second highest for cultural event days and religious event days the sales remain consistent(i.e., no increase or decrease w.r.t average event days sales).
Sales on sporting events are 110% to the average sales on event days i.e., there is an increase of 10% to the average sales on event days.
Sales on cultural events are 108% of the average sales on event days i.e., there is an increase in 8% to the average sales on event days.
Sales on national events are 89% to the average sales on event days i.e., there is a decrease in 11% to the average sales on event days.
Sales are higher on non-event days compared to event days and on a whole, the sales decrease with event days (people shop less on holidays) but not by much amount.

6.11. Sales trend from start Vs end over a month period:

Observations:

Clearly, we observe that sales are high in starting of the month and remain consistent for the first 15 days of the month.
After 15 days of a month, the sales decrease gradually and become 75% to the sales over the start of the month.

6.12. Sales over weekend Vs weekdays in each month of 2016:

Observations:

Sales are higher on starting days of each month.
Weekend sales are high at both the starting of the month and ending of the month.
There is a decrease in weekday sales towards the end of the month.
But the sales over weekends are consistent (high in both the starting and end of the months).
From this, we can say that sales are impacted by weekends and weekdays.

6.13. Relative change in sales from prior non-snap days to snap days:

Observations:

Each snap event lasts for “15 days” hence we checked for a change in sales from the last 15 days from the first snap day to the sales from 15 snap event days.
We observe that there is always an increase in sales over snap events (snap events across three states) for Walmart.
Here we plotted relative change/increase in sales for each of 60 snap events that occur during the 2011–2016 years.
37th snap event starting from ‘2014–05–01’ and ending on ‘2014–05–15’ (lasting for 15 days) sees increase in sales up to 30.55% and 49th snap event starting from ‘2015–03–01’ and ending on ‘2015–03–15’ (lasting for 15 days) sees increase in sales up to 29.96%.
The least increase in sales over the snap events is 3.31% for the 57th snap events starting from ‘2015–11–01’ and ending on ‘2015–11–15’.
One more interesting thing we observe is snap events occur at beginning of the month.

6.14. Revenue generated for each product category, each state, each store, each year:

Observations:

Food products generate the highest revenue followed by Household products and then lastly Hobbies products.88% of the revenue is generated through Foods and Household products only. Similar to the sales proportion of each product category.
Revenue from California is the highest and Texas and Wisconsin have a similar share of the revenue. Even though the sales proportion of Wisconsin is 27.6% the revenue generated is 26.3% on contrary for California state sales proportion is 43.6% but the revenue generated increases to 45%. Revenue and sales proportion are the same 28.8% for Texas state.
CA_3 store has high revenue generated and CA_4 the lowest same as sales proportion. Interesting TX_3 store which is 6th highest in sales proportion but the revenue generated is 4th highest. On the contrary WI_3 store which is the 5th highest in sales, proportion generates revenue which is at 7th highest.
Texas and California state even though in some of the stores there are fewer sales the revenue generated is high(maybe the high priced products are sold in these states). On contrary in Wisconsin state even though the sales are high the revenue generated is less comparatively(maybe more Foods are sold here which are less priced).
The year 2015 has high revenue generated. We observe the revenue generated increases year by year. 2016 has less revenue since only 4 months of the data is available.

6.15. State, Store, Year’s weekend Vs weekday sales:

Observations:

All three states have high sales on weekends compared to weekdays.
In California and Texas, the Sunday sales are higher than Saturday’s but in Wisconsin, the sales on Sunday are less than Saturday.
In all the stores of California and Texas, the weekend sales are high and Sunday sales are higher than Saturday’s. CA_3 store has high weekend sales as we saw CA_3 also has high revenue and highest sales proportion.
In all the Wisconsin stores the weekend sales are high and Sunday’s sales are less than Saturday’s.
In all the years weekend sales are high compared to weekday sales. In 2011 and 2013 Saturday’s sales are higher than Sunday’s.

7. Existing approach:

7.1. Winner’s solution:

https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163216

In this solution, the model used for training is LightGBM with Tweedie objective. The validation split was the time-based splitting of 5 folds wherein each fold 28 days of sales is forecasted using the prior historical sales data. i.e, 1st fold — train d1-d1577 & predict for d1578-d1605; 2nd fold — train d1-d1829 & predict for d1830-d1857; 3rd fold — train d1-d1857 & predict for d1858-d1885; 4th fold — train d1-d1885 & predict for d1886-d1913; 5th fold — train d1-d1913 & predict for d1914-d1941 and then computing the average of each folds for WRMSSE error. Then the final model is trained from d1-d1941 and predicted for 28 days d1942-d1969. The strategy used in the model split was for each of the 10 stores; for each week one model is trained to forecast sales ex: model m1 — predicts the first 7 days (F1-F7), model m2 — predicts the second 7 days (F8-F14), model m3 — predicts third 7 days (F15-F21), model m4 — predicts fourth 7 days (F22–28) i.e., altogether we train 40 models (10 * 4) [10 corresponds to ten stores and 4 implies 4 weeks for 28-day forecast horizon]. The features extracted were lag features, rolling mean/std with different window size, Here the lag features employed for model m1 are {lag_7,lag_8….lag_21}, m2 has lag features {lag_14,lag_15,…lag_28}, m3 has lag features {lag_21,lag_22,…lag_35} and m4 has lag features {lag_28,lag_29,…lag_42}.

7.2. Improvements:

As we have seen in our analysis from EDA the sell_price value (the price of the product for the given week/store) which is provided per week (average across seven days) is missing. For all the products with missing sell_price values imputed all of these with the mean of the sell_price for that product id. A new set of features on sales data, rolling median features along rolling mean features are created as the median values are less prone to outliers.

8. First-cut Solution:

8.1. Forecasted sales as a mean of sales for the past 4 weeks for that weekday

From EDA we observed that there is a weekly pattern of sales i.e., there is a trend in each weekday sales in a week as weekend sales are high than the weekday sales. Using the most recent 4 weeks sales of train data we try to compute the average sales for each of the weekday. Then forecast sales for the next 4 weeks with this average sales value based on the weekday.

8.2. Moving averages

As the name suggests the sales are forecasted based on the average of the last N day sales where ‘N’ is the hyperparameter. After performing the hyperparameter tuning with various values of ‘N’ it is concluded the N=35 (i.e., forecasting sales on a given day as the mean sales of the last 35 days) gives better performance.

9. Feature Engineering & Data preprocessing:

Firstly we create a single Dataframe by combining sales data, calendar data, prices data. Then preprocess the missing values of prices by transforming with the mean price of that id.
Extracting features from raw time series data play an important role in how well the model is able to forecast sales. Here we create some of the important features on historical sales such as lag features, rolling features.
Lag features are created such that for a product on the current day it gets its sales up to 3 months prior with a shift period in order to account for 28 days to forecast. Constant shift rolling agg mean and median features are also created.
Some other features on calendar data extracted are the day of the month, feature with year value as 0 for 2011 and 1 for 2012 …. 5 for 2016, week number of a day in a month ex: 29th in January corresponds to 5th week of January. feature such that if the day is weekend or not.
Categorical features are transformed into label encoded features using LabelEncoder. Also splitting the values of ‘d’ columns to take only the day number. With this, we transform all the features to be either numerical or label encoded features before applying regression models on the final dataset.

10. Models explanation:

10.1. Linear and Tree-Based models:

After Feature Engineering and Data preprocessing we are left with 58 Million data points which is a massive amount of data. Instead of trying to build ML model on this huge data scale, we use the data from last 2 years i.e. from Jan 1st, 2014. Then we split the data into Train-CV-Test split on time based splitting as follows:

2. SGDRegressor with squared loss, linear model is fitted on the train data and cross-validation with different alpha (multiplier to the regularization term) on CV data is performed and found alpha = 0.1 gives lower WRMSSE score. Then forecasted sales on test data with alpha = 0.1 with local WRMSSE score as 0.8931.

3. Other tree based models such as RandomForest, LGBMRegressor, AdaBoost regressor model are applied. RandomForest works on the technique of bootstrap aggregation which involves training each base_models (decision tree) on a different data sample where sampling is done with replacement, this way it combines multiple decision trees in determining the final output rather than relying on individual decision trees. LGBMRegressor is a gradient boosting technique that is prefixed as ‘Light’ for its high speed and it takes lower memory to run. AdaBoost is also boosting algorithm which helps in combining multiple weak learners into a single strong learner.

4. On hyper-parameter tuning of all these models it is observed that RandomForestRegressor with best hyper-parameters n_estimators = 50, max_depth = 10 gives local WRMSSE score as 0.7487.

LGBMRegressor with best hyper-parameters num_leaves = 125, learning_rate= 0.075 local WRMSSE score as 0.6087.

AdaBoostRegressor with best hyper-parameters n_estimators = 50, learning_rate = 0.05 local WRMSSE score as 0.7944.

10.2. Custom Ensemble:

As we observe the ensemble tree based models performing better than linear models we try to build custom ensemble model as follows:
Firstly split the entire dataset into two datasets train and test based on time-based splitting. Then the train data set is split into further two half datasets D1, D2 based on time-based splitting.
From dataset D1 we do sampling with replacement to get N datasets and train N models as base learners. Here N → number of base-learners is a hyper-parameter. For each trained base-learner we get the predictions of D2 and test data and make use them as features.
We train the meta-model with X_i as N features, where every single feature which we get is from the predictions of D2 on each trained base-learner and Y_i being the real values of D2.
After the training of meta-model, with the N features of test data extracted from N trained base learners we forecast sales from the trained meta-model.
Hyper-parameter tuning is performed on different base-learners, meta-model and ‘N’ → number of base-learners and found out XGBRegressor as base-learner and LGBMRegressor as meta-model with N = 18 gives lower local WRMSSE of 0.6995.

custom ensemble model

custom ensemble predictions

11. Models Comparison:

Performance of the models tried on our problem:

It is observed that moving average and linear model performance is worse than the Naïve model. The best performing models are ensemble models.

12. Kaggle Submission:

Kaggle’s score on each of the models:

From the above private leaderboard score, we observe that of all the models Custom ensemble model performs well as it is able to forecast sales with a lower score(WRMSSE)=0.66282 private score.
Once we get validation days(1914–1941) sales and evaluation days(1942–1969) sales from the final best custom ensemble model and submitting in the given format we get the leaderboard score of 0.66282 which ranks 352 out of 5558 participants and stands in top 7%.

Kaggle leaderboard rank of best performing model

13. Deployment:

After training the custom ensemble model with the best hyper-parameters we store the base models and meta model in pickle files and deployed these into AWS along with the flask API built around the final pipeline which returns the forecasted sales.
Web page is created which takes csv or txt file as input and gives the forecasted sales for the next 28 days i.e., May 23rd, 2016 till June 19th, 2016.
Forecasted sales for the sample product from a store with it’s historical sales in csv file is as follows.

Demo video explaining the results

14. Future Work:

Trying with neural network approach as it gives the flexibility of various loss functions.
Trying with LSTM models for time series forecasting.

15. References:

Complete project can be found here github. For any queries regarding this project, you can contact me on Linkedin.

M5 Forecasting - Accuracy

Estimate the unit sales of Walmart retail goods

Table of Contents:

1. Business problem:

2. Problem Statement:

3. Source of Data and Data overview:

3.1 Data overview:

4. Use of ML on the problem:

5. Performance Metrics:

6. Exploratory Data Analysis:

6.1. Data overview:

6.2. Total sales across all stores daily and yearly pattern:

6.3. Sales proportion:

6.4. Sates and Product category Vs sales proportion:

6.5. Year and Product category Vs sales proportion:

6.6. Daily total sales per item category:

6.7. Top 10 products which highest sales:

6.8. Missing value analysis:

6.9. SNAP Vs Non-SNAP day Sales proportion on each product category and each state:

6.10. Impact of events on sales:

6.11. Sales trend from start Vs end over a month period:

6.12. Sales over weekend Vs weekdays in each month of 2016:

6.13. Relative change in sales from prior non-snap days to snap days:

6.14. Revenue generated for each product category, each state, each store, each year:

6.15. State, Store, Year’s weekend Vs weekday sales:

7. Existing approach:

7.1. Winner’s solution:

7.2. Improvements:

8. First-cut Solution:

8.1. Forecasted sales as a mean of sales for the past 4 weeks for that weekday

8.2. Moving averages

9. Feature Engineering & Data preprocessing:

10. Models explanation:

10.1. Linear and Tree-Based models:

10.2. Custom Ensemble:

11. Models Comparison:

12. Kaggle Submission:

13. Deployment:

14. Future Work:

15. References:

Written by Bharanish Kumar