Retail Analysis with Walmart Data — Part-2

Dhruval Patel
CodeX
Published in
4 min readMay 10, 2022

Building Machine Learning model for 45 stores of Walmart

Photo by Markus Winkler on Unsplash

It’s very difficult to predict the demand of any retail store as there are certain events and holidays which impact sales each day. We have sales data available for 45 stores of Walmart. In part-1 I have covered how to approach the project step-by-step. You learned basic statistical tasks to perform to get insight and now in part-2, you’ll be learning how to tackle unforeseen demands with the help of Machine Learning Algorithms. Let’s get started. Find part-1 below.

Dataset Description

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available. Find my Kaggle notebook here.

In part-1, we imported the libraries, analyze the data, and then answered the statistical tasks. Now it’s time to build a model and predict the weekly sales.

Most machine learning algorithms do not work well in the presence of outliers. So it is desirable to detect and remove outliers.

Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.

Model Building

First, define dependent and independent variables. Here, store, fuel price, CPI, unemployment, day, month, and year are the independent variables and weekly sales is the dependent variable. Now, it’s time to train the model. Import train_test_spit from sklearn.model_selection and train 80% of the data and test on the rest 20% of the data.

We need to standardize the data because we want to bring down all the features to a common scale without distorting the differences in the range of the values.

Train Test Split and Standardization

We have used 4 different algorithms to know which model to use to predict the weekly sales.

(1) Linear Regression

Linear Regression (Output)
Regression Plot

(2) Random Forest Regressor

Random Forest Regressor (Output)
Regression Plot

(3) Decision Tree Regressor

Decision Tree Regressor (Output)
Regression Plot

(4) KNearest Neighbors

KNearest Neighbors (Input)
Regression Plot

I have also done cross-validation, which is a method of assessing ML models that involves training numerous ML models on subsets of the available input data and evaluating them on the complementary subset of data. Cross-validation can be used to detect overfitting, or the failure to generalize a pattern.

Linear Regression CV
Random Forest Regression CV
Decision Tree Regression CV
KNearest Neighbor CV

Here, we have used 4 different algorithms to know which model to use to predict the weekly sales. Linear Regression is not an appropriate model to use as accuracy is very low. However, Random Forest Regression gives an accuracy of almost 95%. so, it is the best model to forecast weekly sales.

Thank you for reading! I would appreciate it if you follow me or share this article with someone. Best wishes.

Your support would be awesome❤️

--

--

Dhruval Patel
CodeX
Writer for

I write technical blogs explaining my Data Science project walkthroughs and the concepts relating to Data Science