Forecasting Walmart Sales with Machine Learning

Anjali Ramesh
7 min readJun 22, 2023

--

In this machine learning project, we utilize historical Walmart sales data to predict store sales. The dataset can be found here.

Walmart is one of the biggest retailers and among the go-to retail stores for one’s household shopping. Known for its lowest prices and cost savings across product categories, a visit to its physical stores is an experience in itself. It is a retail business that generates USD 567 billion worth of sales volume. Walmart has a Data Science and Analytics department that is dedicated to improving customer-client-employee relationships by forecasting sales, recommending products based on customer buying trends, product/customer segmentation, and several other use cases of data science.

Problem Statement

In 2014, Walmart hosted a recruiting competition on Kaggle. Job-seekers were provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and we are tasked with predicting the department-wide sales for each store. In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

Understand Business Requirements & Nature of Data

The business problem I am trying to solve is to forecast future sales data by training machine learning models on historical data. Sales forecasting helps every business make better business decisions. It helps in overall business planning, budgeting, and risk management. It allows companies to efficiently allocate resources for future growth and manage their cash flow. Sales forecasting also helps businesses to estimate their costs and revenue accurately based on which they are able to predict their short-term and long-term performance.

A glimpse into the data

stores.csv

Type and size of 45 stores

features.csv

Data related to the store, department, and regional activity like fuel price, customer price index, unemployment rate, etc. for the given dates

train.csv

The historical training data cover dates in the range 2010–02–05 to 2012–11–01.

The feature that I have to predict is Weekly_Sales, given Store ID, Dept ID, Date and IsHoliday.

Problem Identification

Weekly Sales is a labeled continuous numeric feature. Hence, this is an application of Supervised ML, specifically, it is a regression problem.

Project Outline

  1. Data Preprocessing
  2. Exploratory Data Analysis
  3. Preparing to Train Models
  4. Implement and Train Models
  5. Model Evaluation
  6. Hyperparameter Tuning
  7. Model Selection
  8. Prediction and Analysis

1. Data Preprocessing

Preparing historical store sales data by cleaning, transforming, and encoding categorical variables as necessary. Perform feature engineering and feature selection.

The following preprocessing tasks have been performed

  • Merging stores and features data frames with train and test data with pandas merge.
  • Converting boolean features to 0s and 1s
  • Checking for missing/nan values
train_df.isna().sum()
  • The only features with missing values are MarkDown(1–5) which are anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011 and is not available for all stores all the time. Any missing value is marked with an NA. I have replaced the nan values in these columns with 0.
  • The cleaned data looks like this —
train_df

Feature Engineering

  • From the Date column, I have extracted DayOfMonth, Month, Year, DayOfWeek, WeekOfYear, and Quarter and dropped the Date column
  • Created a new column called MarkDown which is the sum of all MarkDown(1–5) columns after which those 5 columns were dropped
  • In the IsHoliday column, I have changed the value to 1 for national and federal holidays as weeks with holidays might have higher sales than the weeks without holidays. Walmart could have even offered promotions during holidays.

Summary Statistics

2. Exploratory Data Analysis

Heatmap of the correlation matrix

From this heatmap, Size and Dept seem to have the highest correlation with Weekly_Sales

Weekly Sales vs. Week of Year

Huge spike during the holidays for all three years

Weekly Sales vs. Month of Year

As expected, Months 11 and 12 seem to have the highest recorded weekly sales.

Weekly Sales vs. Day of Month

There is a huge spike around Christmas

Weekly Sales vs. Quarter

As expected, 4th quarter has highest recorded average sales

Average Weekly Sales vs. Store

The top 5 stores with high average weekly sales are 20,4,14,13,2

Weekly Sales vs. Store Type

Store Type A seems to have higher weekly sales, followed by B and then C

Temperature vs. Weekly Sales

High weekly sales seem to be recorded when temperatures are pleasant — in the range of 30–70 Fahrenheit

Weekly Sales vs IsHoliday

Holiday weeks have more high weekly sales outliers compared to non-holiday weeks

Weekly Sales vs Store Size and Store Type

Store type A is bigger Walmarts and therefore it makes sense that they record the highest weekly sales

Weekly Sales vs. Department

The top 5 performing departments are 95,38,40,92,90

3. Preparing to Train Models

In this section, I have performed the following tasks

  • Splitting into train and validation sets (75–25 split)
  • Identifying numeric and categorical columns
  • Identifying input and target columns
  • Imputation, Scaling, and Encoding

4. Implement and Train ML models

Baseline ML model — Linear models

Since this is a regression problem, I shall set LinearRegression() model as my baseline ML model. I have additionally trained Ridge, Lasso, ElasticNet, and SGDRegressor in order to conduct a comprehensive comparison of the performance of all linear models offered by Sklearn.

I have created a function try_linear_models(model) that takes in the model as input, performs training on training data, and returns the training and validation root mean squared errors.

[training score, validation score]

We see that the linear models on average returns RMSE scores of ~$20,000. This is a very poor performance by our baseline ML models as ~20,000 is around the 75th percentile of weekly sales. The model r2 score is very low and therefore I do not find it necessary to investigate the linear models further.

Ensemble Models

I have trained ensemble models like random forest, gradient boosting, adaboost, XGBoost, and LightGBM with default parameters as an initial check to see which models have better evaluation score. I have created a function try_ensemble_methods() which takes in the model as input and returns the evaluation metrics.

5. Model Evaluation

There are the evaluation scores —

The top three performing models are Random Forest, XGBoost and LightGBM and therefore I will go ahead with hyperparameter tuning for these models. AdaBoost performs as poorly as the linear models.

Dept and Size are the two most important features that determine weekly sales

6. Hyperparameter Tuning

The hyperparameter tuning is evaluated based on weighted mean absolute error (WMAE) .

I created a dictionary called models with hyper-parameters and values to test.

Functions test_params() and test_param_and_plot() trained the models for each of these hyperparameters and returned the evaluation scores for each run.

The overfitting curves for random forest hyperparameter tuning are —

tuning for max_depth
tuning for n_estimators
tuning for min_samples_split

The overfitting curves for XGBoost hyperparameter tuning are —

tuning for max_depth
tuning for n_estimators
tuning for learning_rate

The overfittng curves for LightGBM hyperparameter tuning are —

tuning for max_depth
tuning for n_estimators
tuning for learning rate

I find that after hyperparameter tuning, the model that works best with this data is Random Forest.

7. Model Selection

From hyperparameter tuning, random forest seems to work the best — it provides the best validation scores ~2000. Further, I used the GridSearchCV to obtain the best fit random forest model.

Validation WMAE: $1985

8. Test Predictions and Submission

From the different submissions, Random Forest is the best performing model with n_estimators = 200 and max_depth = 25. Public score of ~4000 puts me in the top 50% of the competition submissions.

Random Forest regressor with n_estimators = 200 and max_depth = 25 was the best model among all the trained models

Am I satisfied/confident with this model? NO. Did I learn how to implement an end-to-end ML project? YES.

9. My final thoughts

Better sales forecasting can be performed by employing time-series modeling techniques like ARIMA. The top predictions in kaggle have all employed some form of time-series techniques. This project can be improved by doing so. Applications of time series forecasting are web traffic forecasting, sales and demand forecasting, weather prediction, stock price forecasting, among others. It is an important branch of machine learning and more companies employ time-series modeling in some form or another. It is therefore crucial for a budding data scientist to be familiar with this concept.

10. Inspiration

I would like to thank jovian.ai, especially Aakash, for providing excellent learning materials as part of the data science and machine learning bootcamp. Most of the code is inspired from the bootcamp tutorials and ChatGPT of course.

--

--

Anjali Ramesh

A data scientist with a diverse background in mathematics, applied mathematics, quantum physics, astrophysics, education, dance, and the sky is the limit