Walmart — Store Sales Forecasting

Sergio Alves
5 min readAug 7, 2021

--

Forecasting based on Decision Trees & Random Forest

For this Machine Learning project, we will use the “Walmart Recruiting — Store Sales Forecasting” dataset, from Kaggle.

The goal is to predict the Weekly Sales for specific stores, departments and dates.

Download Data

First, we install the opendatasets library.

Now, we import some libraries that we will use.

Here we download the datasets from Kaggle.

Let’s review the downloaded files.

stores.csv: This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv: This is the historical training data, which covers to 2010–02–05 to 2012–11–01. Within this file we will find the following fields:Store — the store number.

  • Store — the store number
  • Dept — the department number
  • Date — the week
  • Weekly_Sales — sales for the given department in the given store
  • IsHoliday — whether the week is a special holiday week

test.csv: This file is identical to train.csv, except we have withheld the weekly sales. We must predict the sales for each triplet of store, department, and date in this file.

features.csv: This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

  • Store — the store number.
  • Date — the week.
  • Temperature — average temperature in the region.
  • Fuel_Price — cost of fuel in the region.
  • MarkDown1–5 — anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
  • CPI — the consumer price index.
  • Unemployment — the unemployment rate.
  • IsHoliday — whether the week is a special holiday week.

Now, we get the zip files and create the datasets that we will use.

Before starting to work with the data, we can merge the files train, stores and features, in order to increase the number of input variables.

Data exploration

Let’s have a look of our dataset

We can identify the following input variables:

  • Store
  • Dept
  • Date
  • IsHoliday
  • Type
  • Size
  • Temperature
  • Fuel_Price
  • MarkDown1
  • MarkDown2
  • MarkDown3
  • MarkDown4
  • MarkDown5
  • CPI
  • Unemployment

The Target variable is Weekly_Sales

Now let’s review some measures of central tendency.

Null Values

There are some null values, let’s review them

We will drop the null values in the Data Manipulation section.

Input Variables Correlation with the output feature Weekly_Sales

By watching the correlation matrix, we can see that Weekly_Sales have a higher correlation with Store, Dept and Size.

We will drop the variables with lower correlation in the Data Manipulation section.

Data Manipulation

Now, we will do the following steps:

  • Remove null values from the markdown variables.
  • Create variables for year, month and week, based on the date field.
  • Remove the variables with low correlation.

We can move the target variable to the last column of the dataframe to ease the manipulation of the data.

Here, we identify inputs and target columns.

Now, we identify numeric and categorical columns.

Here, we impute (fill) and scale numeric columns.

We can only use numeric data to train our models, that’s why we have to use a technique called “one-hot encoding” for our categorical columns.

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

Finally, let’s split the dataset into a training and validation set. We’ll use a randomly select 25% subset of the data for validation. Also, we’ll use just the numeric and encoded columns, since the inputs to our model must be numbers.

Models

Now that we have our training and validation set, we will review two models of machine learning:

  • Decision Tree
  • Random Forest

Based on the results, we will pick one of them.

Decision Tree

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

To create our decision tree model, we can use the function DecisionTreeRegressor.

Now, we fit our model to the training data.

We generate predictions on the training and validation sets using the trained decision tree, and compute the Root Mean Squared Error (RMSE) loss.

Here, we can see that the RMSE loss for our train data is 6.045479193458183e-20, and the RMSE loss for our validation data is 5441.340994336662

Decision tree visualization

Let’s visualize the tree graphically using plot_tree.

Now, let Visualize the tree textually using export_text.

Here we can display the first few lines.

Decision Tree feature importance

Let’s look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

The variables Dept, Size, Store and Week are the most important for this model.

Random Forests

Random forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time.

For classification tasks, the output of the random forest is the class selected by most trees.

For regression tasks, the mean or average prediction of the individual trees is returned.

To create our random forest model, we can use the function RandomForestRegressor.

When I created the random forest with the default number of estimators (100), the jupyter notebook crashed due to a lack of memory, so let’s start with a number of estimators of 10.

Now, we fit our model to the training data.

Now we generate predictions on the training and validation sets using the trained random forest, and compute the Root Mean Squared Error (RMSE) loss.

Here, we can see that the RMSE loss for our train data is 1620.993367981347, and the RMSE loss for our validation data is 3997.6712441772224

The random forest model shows better results for the validation RMSE, so we will use that model.

Hyperparameter Tuning

Let’s define a helper function test_params which can test the given value of one or more hyperparameters.

For this new random forest model, I will use a number of estimators of 16.

Let’s also define a helper function to test and plot different values of a single parameter.

We can see better results with a higher number of estimators.

Here, we can see how the RMSE increases with the min_samples_leaf parameter, so we will use the default value (1).

The RMSE decreases with the max_leaf_nodes parameter, so we will use the default value (none).

The RMSE decreases with the max_depth parameter, so we will use the default value (none).

Training the Best Model

We create a new Random Forest model with custom hyperparameters.

Now, we train the model.

Now, we generate predictions for the final model.

Here, we can see a decrease for the RMSE loss.

Random Forest Feature Importance

Let’s look at the weights assigned to different columns, to figure out which columns in the dataset are the most important for this model.

https://jovian.ml/sergioalves94/walmart-store-sales-forecasting/v/45&cellId=124

The variables Dept, Size, Store and Week are the most important for this model.

Making Predictions on the Test Set

Let’s make predictions on the test set provided with the data.

First, we need to reapply all the preprocessing steps.

We can now make predictions using our final model.

Let’s replace the values of the Weekly_Sales column with our predictions.

Let’s save it as a CSV file and download it.

Making Predictions on Single Inputs

Saving the Model

References

Check out the following resources to learn more:

--

--