Forecasting Walmart Sales with Machine Learning
In this machine learning project, we utilize historical Walmart sales data to predict store sales. The dataset can be found here.
Walmart is one of the biggest retailers and among the go-to retail stores for one’s household shopping. Known for its lowest prices and cost savings across product categories, a visit to its physical stores is an experience in itself. It is a retail business that generates USD 567 billion worth of sales volume. Walmart has a Data Science and Analytics department that is dedicated to improving customer-client-employee relationships by forecasting sales, recommending products based on customer buying trends, product/customer segmentation, and several other use cases of data science.
Problem Statement
In 2014, Walmart hosted a recruiting competition on Kaggle. Job-seekers were provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and we are tasked with predicting the department-wide sales for each store. In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.
Understand Business Requirements & Nature of Data
The business problem I am trying to solve is to forecast future sales data by training machine learning models on historical data. Sales forecasting helps every business make better business decisions. It helps in overall business planning, budgeting, and risk management. It allows companies to efficiently allocate resources for future growth and manage their cash flow. Sales forecasting also helps businesses to estimate their costs and revenue accurately based on which they are able to predict their short-term and long-term performance.
A glimpse into the data
Type and size of 45 stores
Data related to the store, department, and regional activity like fuel price, customer price index, unemployment rate, etc. for the given dates
The historical training data cover dates in the range 2010–02–05 to 2012–11–01.
The feature that I have to predict is Weekly_Sales, given Store ID, Dept ID, Date and IsHoliday.
Problem Identification
Weekly Sales is a labeled continuous numeric feature. Hence, this is an application of Supervised ML, specifically, it is a regression problem.
Project Outline
- Data Preprocessing
- Exploratory Data Analysis
- Preparing to Train Models
- Implement and Train Models
- Model Evaluation
- Hyperparameter Tuning
- Model Selection
- Prediction and Analysis
1. Data Preprocessing
Preparing historical store sales data by cleaning, transforming, and encoding categorical variables as necessary. Perform feature engineering and feature selection.
The following preprocessing tasks have been performed
- Merging stores and features data frames with train and test data with pandas merge.
- Converting boolean features to 0s and 1s
- Checking for missing/nan values
- The only features with missing values are MarkDown(1–5) which are anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011 and is not available for all stores all the time. Any missing value is marked with an NA. I have replaced the nan values in these columns with 0.
- The cleaned data looks like this —
Feature Engineering
- From the Date column, I have extracted DayOfMonth, Month, Year, DayOfWeek, WeekOfYear, and Quarter and dropped the Date column
- Created a new column called MarkDown which is the sum of all MarkDown(1–5) columns after which those 5 columns were dropped
- In the IsHoliday column, I have changed the value to 1 for national and federal holidays as weeks with holidays might have higher sales than the weeks without holidays. Walmart could have even offered promotions during holidays.
Summary Statistics
2. Exploratory Data Analysis
Heatmap of the correlation matrix
Weekly Sales vs. Week of Year
Weekly Sales vs. Month of Year
Weekly Sales vs. Day of Month
Weekly Sales vs. Quarter
Average Weekly Sales vs. Store
Weekly Sales vs. Store Type
Temperature vs. Weekly Sales
Weekly Sales vs IsHoliday
Weekly Sales vs Store Size and Store Type
Weekly Sales vs. Department
3. Preparing to Train Models
In this section, I have performed the following tasks
- Splitting into train and validation sets (75–25 split)
- Identifying numeric and categorical columns
- Identifying input and target columns
- Imputation, Scaling, and Encoding
4. Implement and Train ML models
Baseline ML model — Linear models
Since this is a regression problem, I shall set LinearRegression() model as my baseline ML model. I have additionally trained Ridge, Lasso, ElasticNet, and SGDRegressor in order to conduct a comprehensive comparison of the performance of all linear models offered by Sklearn.
I have created a function try_linear_models(model) that takes in the model as input, performs training on training data, and returns the training and validation root mean squared errors.
We see that the linear models on average returns RMSE scores of ~$20,000. This is a very poor performance by our baseline ML models as ~20,000 is around the 75th percentile of weekly sales. The model r2 score is very low and therefore I do not find it necessary to investigate the linear models further.
Ensemble Models
I have trained ensemble models like random forest, gradient boosting, adaboost, XGBoost, and LightGBM with default parameters as an initial check to see which models have better evaluation score. I have created a function try_ensemble_methods() which takes in the model as input and returns the evaluation metrics.
5. Model Evaluation
There are the evaluation scores —
The top three performing models are Random Forest, XGBoost and LightGBM and therefore I will go ahead with hyperparameter tuning for these models. AdaBoost performs as poorly as the linear models.
6. Hyperparameter Tuning
The hyperparameter tuning is evaluated based on weighted mean absolute error (WMAE) .
I created a dictionary called models with hyper-parameters and values to test.
Functions test_params() and test_param_and_plot() trained the models for each of these hyperparameters and returned the evaluation scores for each run.
The overfitting curves for random forest hyperparameter tuning are —
The overfitting curves for XGBoost hyperparameter tuning are —
The overfittng curves for LightGBM hyperparameter tuning are —
I find that after hyperparameter tuning, the model that works best with this data is Random Forest.
7. Model Selection
From hyperparameter tuning, random forest seems to work the best — it provides the best validation scores ~2000. Further, I used the GridSearchCV to obtain the best fit random forest model.
1985
8. Test Predictions and Submission
From the different submissions, Random Forest is the best performing model with n_estimators = 200 and max_depth = 25. Public score of ~4000 puts me in the top 50% of the competition submissions.
Am I satisfied/confident with this model? NO. Did I learn how to implement an end-to-end ML project? YES.
9. My final thoughts
Better sales forecasting can be performed by employing time-series modeling techniques like ARIMA. The top predictions in kaggle have all employed some form of time-series techniques. This project can be improved by doing so. Applications of time series forecasting are web traffic forecasting, sales and demand forecasting, weather prediction, stock price forecasting, among others. It is an important branch of machine learning and more companies employ time-series modeling in some form or another. It is therefore crucial for a budding data scientist to be familiar with this concept.
10. Inspiration
I would like to thank jovian.ai, especially Aakash, for providing excellent learning materials as part of the data science and machine learning bootcamp. Most of the code is inspired from the bootcamp tutorials and ChatGPT of course.