M5 Forecasting — Accuracy
Estimate the unit sales of Walmart retail goods
Table of Contents:
- Business problem
- Source of data
- Machine Learning Problem
- Exploratory Data Analysis
- Feature Engineering & Data pre-processing
- Models approach
- Kaggle Submission
- Future work
1. Business problem:
In today’s ultra-competitive business landscape, everyone wants to increase their revenue/sales. We as a whole heard about strategic planning when to hire new people, but do you really know when exactly you need to expand your team, start your next promotional campaign or launch your new item? For all these things sales forecasting is done.
So are we predicting future? yes!! but not so accurately. It is done by employing Machine learning techniques.
Forecasting provides the knowledge about the nature of future conditions. A sales forecasting helps every business make better business decisions. It helps in overall business planning, budgeting, and risk management. It allows companies to efficiently allocate resources for future growth and manage its cash flow. It also helps businesses to estimate their costs and revenue accurately based on which they are able to predict their short-term and long-term performance.
Past sales performance is a good leading indicator of future sales performance. The MOFC is well known for its Makridakis Competitions and In this problem, the fifth iteration(M5 forecasting) has given the historical sales(past sales) data from Walmart, the world’s largest retail company.
1.2 Objective :
The main objective is to estimate point forecasts of the unit sales of various products sold in the USA by Walmart as precisely as possible which in turn helps different Walmart stores to increase their revenues.
2. Source of Data
The hierarchical sales data of Walmart is provided in this competition M5 Forecasting-Accuracy.
2.1 Data overview:
The sales data provided from Walmart, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events.
- Calendar.csv: It contains the information about the dates on which the products are sold and the events and programs that held on that day.
- Sales_train_evaluation.csv: It contains the historical daily unit sales of each product on each store from day 1 to day 1941.
- Sell_prices.csv: It contains information about the price of the products on each week per store.
3. Machine Learning Problem
The above problem is a time-series data problem and can be solved using Classical Machine Learning techniques to estimate the unit sales to a particular day by using historical sales data. The given time series data can be re-framed as a supervised learning dataset by using Feature Engineering techniques and then can apply machine learning algorithms on it. It can be solved using a Machine Learning Regression model as input variables are generated features and items sold on a particular date as output variable which belongs to a real number.
4. Exploratory Data Analysis
Among 3049 unique products, around 47%(1437) of products falls under FOODS category , 34%(1047) falls under HOUSEHOLD category and leftover 18.5%(565) products are of HOBBIES category.
There is a smooth growth of sales. (upward trend) with repeating seasonal patterns. There is a downfall of sales in December month of each year.
We can clearly notice that stores located in CA is performing well having an upward trend(29.1 million total sales). Until October 2012, WI stores monthly sales was low compared to others . But after October 2012, the monthly sales has improved performing similar to TX stores. WI has sales of around 18.5 million which is the least comparatively.
CA_3 store sales is highest with 11.3 million sales overall.WI_2 and WI_3 have reasonably good amount of sales even though WI has less population.CA_4 has the least sales with 4.1 million.
It is clear that FOOD products are of highest demand with gradual increase in sales over the years. Around 68 percent of the sales are from the FOODS products. By this plot we can tell FOODS category has the highest demand among other categories. Products of category HOBBIES has the least percentage of sales.(i.e. around 9%)
FOOD_3 department products stands out be the highest on demand product department with upward trend of sales and having repeating seasonality. From June 2012, HOUSEHOLD_1 has increased its sales having a slight upward trend. Products of department FOODS_3 holds around 49% of the total sales with 32.9 million products sold.HOBBIES_2 department products were least sold with a total sales of 0.5 million.
FOODS_3_090 product has the highest demand among 3049 products with a total sales of 1.0179M sales. Second highest on demand product is FOODS_3_586 with a total sales of 932k sales.HOUSEHOLD_2_101 product is the product with least total sales of 593.
FOODS_3_090 product has the highest demand among 3049 products with a total sales of 493k sales and HOBBIES_1_052 product is the product with least total sales of 239 in California .FOODS_3_226 product has the highest demand among 3049 products with a total sales of 250.7k sales and HOUSEHOLD_2_130 product is the product with least total sales of 109 in Wisconsin.
We can clearly notice that average sales on Saturday and Sunday are high compared to other days with an average sales of around 41.7k.Sales in the month of August has the highest sales with average of around 36k.December month has the least sales with average sales of 32.9k.
Usually monthly sales increased every year. If we compare 2013 and 2014 year monthly sales, 2014 monthly sales have dropped slightly.
Average sales on labour day is highest with 42154.Sales on Christmas is very low with average sales of 15.
We can clearly notice that selling price is not constant. Selling price differs in some weeks for different stores.(There is a fixed price)Fluctuation of sales in CA_3.Usually the prices are similar for stores in CA_1 and CA_2.Usually the prices are similar for stores in stores located in TX. We can find no change in selling price from past 25 weeks.
We can observe that ‘wm_yr_wk’ and ‘year’ are the most prominent features contributing fore sales. ‘month’ feature is also having positive correlation of 0.022.
5.Feature Engineering and Data Pre-Processing
- As the dataset consists of time series data , so the data is reframed as a supervised learning single dataset. The sales dataset is melted (convert the wide data format to long data format) and merged all the other datasets.
- Handled the missing values as it comes from sell price data, through mean imputation techniques.
- Converting all the categorical features by replacing them with their cat codes.
- Feature engineering is employed to extract various features such as lag features and rolling median features.
- Picking last 15 months data(data after 1500 days) just to make processing faster.
6.1 First Cut Models
- Before applying different algorithms to create models , we need to split the data into train , test and validation data.
- After training the model and measuring the performance metric we need to predict the sales for the unseen data. Finally we need to generate a CSV file eligible for Kaggle submission.
- After creating functions for data splitting and submission file generation now we can try different algorithms and measure the performance metrics to compare the built models.
- As the first cut models, tried with basic regression models like Linear Regression , Ridge Regression , Bayesian Ridge Regression and Elastic Net.
- Input features excluding rolling median features usually performed better.
- Boosting algorithms such as Lightgbm Regressor and Catboost Regressor performed better than basic models.
- The final score for all the models received are as follows
6.2 Kaggle Submissions
6.3 Best Model
Custom Ensemble Model
- Split dataset into train and test set.
- Split train dataset into two equal non-intersecting subsets — D1 and D2
- Create k subsamples of D1 and build a base model for each subsampled dataset. Train these base models on corresponding subsampled data of D1.
- Feed D2 data as the test data set to make predictions to all the base models. Horizontally concatenate all those predictions of k base models to create a new dataset, say meta-dataset. Use labels of D2 as its target variable.
- Train meta-model on this meta-dataset
- Check the performance of this meta-model on test dataset by feeding test dataset to base models and then stacking the predictions of base models to create meta-dataset which will be input for meta-model.
- This combination of base models and meta-model is our custom ensemble model.
- After creating custom functions required for implementing custom ensemble model.
Optimal base learners are around 10 , but we don't know the exact number of optimal base learners .So again finding optimal base learners between 6 and 14.
7. Kaggle Submission
After training the model with optimal base learners i.e. 8 got the best score of 0.65766
- From the above private leader board score we observe that of all the models Custom ensemble model performs well as it is able to forecast sales with lower score(WRMSSE)=0.65766 private score.
- Once we get validation days(1914–1941) sales and evaluation days(1942–1969) sales from the final best custom ensemble model and submitting in the given format we get the leader board score of 0.65766 which ranks 331 out of 5558 participants and stands in top 6%.
- After training the custom ensemble model with best hyper-parameters we store the base models and meta model in pickle files and deployed these into AWS EC2 along with the flask API built around the final pipeline which takes ITEM_ID and STORE_ID as input and returns the plotted forecasted sales as the output.
- Html page is created which takes ITEM_ID and STORE_ID as input and gives the plotted forecasted sales of the particular item for the next 28 days i.e., May 23rd 2016 till June 19th 2016.
Deployed application link is here.
9. Future Work:
- Trying with Neural network approach as it gives the flexibility of loss functions.
- Trying with LSTM models for time series forecasting.
Complete code and model implementation is available on my GitHub repository. You can check it out here
You can connect with me on LinkedIn here