Forecasting Walmart Sales Using Machine Learning Models

12 min readJun 4, 2019

The purpose of this case study is to show how simple machine learning can make the sales forecasting process. Many models are powerful and flexible enough to be implemented in any industry, but in this study, we are going to be forecasting sales for a retail company. Walmart, to be specific. As part of a recent recruiting effort, Walmart shared anonymized weekly sales data for 45 of its stores and asked candidates to forecast future sales. As a retail juggernaut, they seem like a great company to use in a demonstration of machine learning in forecasting?

To be as explicit as we possibly can, let’s state the question that we are seeking to answer: what will be the total monthly sales for the next 12 months with a range to account for error?

Now to get started!

1.0 Import the Data

Before we dive into the data, we need to get it into the system first. Additionally, we want to make sure that we have all of the necessary packages that we’ll use to clean, transform and model the data.

packages <-
 c(
 “caret”,
 “forecast”,
 “h2o”,
 “janitor”,
 “lubridate”,
 “prophet”,
 “readxl”,
 “tidyverse”,
 “timetk”
 )lapply(packages, library, character.only = TRUE)h2o.init()
h2o.no_progress()stores <- read_csv(“/Users/auggieheschmeyer/Downloads/stores data-set.csv”)
sales <- read_csv(“/Users/auggieheschmeyer/Downloads/sales data-set.csv”)

Simple enough!

2.0 Explore the Data

Now that everything is loaded, we need to explore the data to get a sense of its structure and the values contained within.

stores %>% glimpse()## Observations: 45
## Variables: 3
## $ Store <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Type <chr> “A”, “A”, “B”, “A”, “B”, “A”, “B”, “A”, “B”, “B”, “A”, “B”…
## $ Size <dbl> 151315, 202307, 37392, 205863, 34875, 202505, 70713, 15507…sales %>% glimpse()## Observations: 421,570
## Variables: 5
## $ Store <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Dept <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Date <chr> “05/02/2010”, “12/02/2010”, “19/02/2010”, “26/02/20…
## $ Weekly_Sales <dbl> 24924.50, 46039.49, 41595.55, 19403.54, 21827.90, 2…
## $ IsHoliday <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…summary(stores)## Store Type Size 
## Min. : 1 Length:45 Min. : 34875 
## 1st Qu.:12 Class :character 1st Qu.: 70713 
## Median :23 Mode :character Median :126512 
## Mean :23 Mean :130288 
## 3rd Qu.:34 3rd Qu.:202307 
## Max. :45 Max. :219622summary(sales)
## Store Dept Date Weekly_Sales 
## Min. : 1.0 Min. : 1.00 Length:421570 Min. : -4989 
## 1st Qu.:11.0 1st Qu.:18.00 Class :character 1st Qu.: 2080 
## Median :22.0 Median :37.00 Mode :character Median : 7612 
## Mean :22.2 Mean :44.26 Mean : 15981 
## 3rd Qu.:33.0 3rd Qu.:74.00 3rd Qu.: 20206 
## Max. :45.0 Max. :99.00 Max. :693099 
## IsHoliday 
## Mode :logical 
## FALSE:391909 
## TRUE :29661

What are we looking at? The “stores” dataset has 45 rows and three columns: one row for each of the 45 stores and variables to describe the store number (1–45), the store type (A or B) and the size (in sqft.).

The “sales” dataset is much larger: 421,570 rows and five columns. This is due to the fact that there are about two and a half years worth of weekly sales data for each of the 45 stores’ 99 departments. That’s a lot of sales! Also, pay attention to the inclusion of a holiday variable as that will come into play later.

3.0 Prepare the Data for Modelling

# Join the stores and sales data sets and convert variables to the proper format
all_data <- stores %>%
 inner_join(sales, by = “Store”) %>%
 mutate_if(grepl(“[0–9]{2}\\/[0–9]{2}\\/[0–9]{4}”, .), list( ~ as.Date(., format = “%d/%m/%Y”))) %>%
 mutate_at(c(“Type”, “Dept”),
 factor) %>%
 clean_names()
 
# Create a list of holidays for use in one of the models
holiday_dates <- all_data %>%
 filter(is_holiday == TRUE & date < as.Date(“2012–01–01”)) %>%
 select(date) %>%
 unique() %>%
 c()
 
holidays <- tibble(
 holiday = “holiday”,
 ds = holiday_dates$date,
 lower_window = 0,
 upper_window = 1
)
 
# Subset the data to focus just on Store 1, Department 1
store_1_dept_1 <- all_data %>%
 filter(store == 1 & dept == 1) %>%
 select(date, is_holiday, weekly_sales)
 
# Split the data into training and test sets
train_tbl <- store_1_dept_1 %>% filter(date < as.Date(“2012–01–01”))
test_tbl <- store_1_dept_1 %>% filter(date >= as.Date(“2012–01–01”))
 
# Convert the data into a timeseries object
ts_1_1 <-
 ts(train_tbl$weekly_sales,
  start = c(2010, 5),
  frequency = 52)

In short, what the above code is doing is combining the “stores” and “sales” datasets, filtering that combined dataset to focus on just Store 1, Department 1 (S1D1) and then converting that data into a timeseries format (basically just dates and numbers). We’ve filtered the data to just S1D1 as this case study is merely meant to be a demonstration and we want to save ourselves the headache of dealing with that much data.

Additionally, we took the crucial step of dividing up this S1D1 dataset into smaller training and test sets. Given that we’re dealing with time-based data here, we used a date to split the data. Our data starts in February of 2010 and ends in October 2012, so we used January 1st, 2012 as our split point. Every day before that date will be used to train our models and every day afterward will be used to test their effectiveness.

4.0 Fit the Forecasting Models to the Training Data

We’re going to be using four different models to forecast this data: a neural network, Facebook’s Prophet, H2O’s AutoML and a basic linear regression model. Think of this process as trying out different pizza recipes to find the one closest to your little, Italian grandmother’s recipe. There are lots of recipes for pizza, but certain recipes are closer to the one we want than others. The same goes for forecasting models. We are going to be “baking” four different “pizzas” to see which comes closest to being the pizza for which we’re looking.

Also, be sure to catch our old friend, the holiday variable and dataset, at play in fitting the Prophet and H2O models!

set.seed(111)
 
# Neural Net
nnet_model <- nnetar(ts_1_1, size = 3, repeats = 1000)
 
# Prophet
prophet_1_1 <- train_tbl %>%
 select(date, weekly_sales) %>%
 rename(“ds” = date,
  “y” = weekly_sales)
 
prophet_model <-
 prophet(
  prophet_1_1,
  yearly.seasonality = TRUE,
  weekly.seasonality = TRUE,
  holidays = holidays
 )
 
future <-
 make_future_dataframe(prophet_model, periods = 43, freq = “week”)
 
prophet_forecast <- predict(prophet_model, future)
 
# H2O
store_1_1_clean <- store_1_dept_1 %>%
 select(date, weekly_sales, is_holiday) %>%
 tk_augment_timeseries_signature() %>%
 select_if(~ !is.Date(.)) %>%
 select_if(~ !any(is.na(.))) %>%
 mutate_if(is.ordered, ~ as.character(.) %>% as.factor)
 
train_clean <- store_1_1_clean %>% filter(year < 2012)
test_clean <- store_1_1_clean %>% filter(year == 2012)
 
train_h2o <- as.h2o(train_clean)
test_h2o <- as.h2o(test_clean)
 
y <- “weekly_sales”
x <- setdiff(names(train_h2o), y)
 
h2o_model <- h2o.automl(
 x = x,
 y = y,
 training_frame = train_h2o,
 leaderboard_frame = test_h2o,
 max_runtime_secs = 120,
 stopping_metric = “RMSE”
)
 
h2o_leader <- h2o_model@leader
 
# Linear Regression
lm_model <-
 lm(weekly_sales ~ ., data = train_clean[, names(store_1_1_clean) != “wday.lbl”])

5.0 Compare the Models on the Test Data

Now that our pizzas/models are ready, it’s time to start comparing. The first chunk of the below code has each of the four models making their best guess at what the training data will look like without actually seeing that data. Then we line those predictions up next to the actual data and use the second chunk of code to generate a comparison plot.

# Make the predictions
pred_data <- store_1_dept_1 %>%
 filter(year(date) == 2012) %>%
 select(-is_holiday) %>%
 add_column(nnet_pred = forecast(nnet_model, h = 43) %>% as_tibble() %>% pull(`Point Forecast`)) %>%
 add_column(
 prophet_pred = predict(prophet_model, future) %>% as_tibble() %>% filter(year(ds) == 2012) %>% pull(yhat)
 ) %>%
 add_column(h2o_pred = h2o.predict(h2o_leader, test_h2o) %>% as_tibble() %>% pull(predict)) %>%
 add_column(lm_pred = predict(lm_model, test_clean) %>% tibble::enframe(name = NULL) %>% pull(value))
 
# Plot the actual and predicted values
pred_data %>%
 gather(“prediction”, “value”,-date) %>%
 ggplot(aes(x = date, y = value, color = prediction)) +
 geom_line() +
 scale_x_date(date_breaks = “4 week”, date_labels = “%B %d”) +
 scale_color_manual(
  values = c(
   “weekly_sales” = “black”,
   “h2o_pred” = “#a5d8f3”,
   “lm_pred” = “#fdc7d7”,
   “nnet_pred” = “#ff9de6”,
   “prophet_pred” = “#e8e500”
  )
 ) +
 theme_classic() +
 labs(title = “comparing model fit”,
  subtitle = “test data: 2012”,
  x = “date”,
  y = “weekly sales (in USD)”,
  caption = “data: walmart”) + 
 theme(text = element_text(family = “Futura Medium”),
 plot.title = element_text(hjust = .5),
 plot.subtitle = element_text(hjust = .5),
 axis.text.x = element_text(angle = 45, hjust = 1))

Let’s break down what we’re looking at. The black line with the two big peaks represents the actual weekly sales data from 2012 and each of the colored lines shows the predictions made by each of the four models. As you can see, none of the models seemed to quite capture the spike in sales corresponding with the holidays (thanks a lot, holiday variable!). The neural net model came closest, but was off by a few days and features an ugly dip in between. The other models, while underestimating the peaks, did a nice job of predicting when the peaks would come and go.

In the second half of the chart things smooth out and all of the lines seem to run in tandem. What this tells us is that most of the models do a great job at capturing the trends and variance in weekly sales throughout the rest of the year. But with such close lines, how do we measure which model captures that information best? That’s where this next chunk of code comes in.

# Calculate performance metrics for the four models
forecast_measures <- function(df) {
 matrix <- matrix(nrow = 5, ncol = (ncol(df) — 2))
 for (i in 3:ncol(df)) {
  error <- df[2] — df[i]
  error_pct <- error / df[2]
  matrix[1, i — 2] <- colMeans(error)
  matrix[3, i — 2] <- colMeans(abs(error))
  matrix[5, i — 2] <- colMeans(error_pct)
  matrix[4, i — 2] <- colMeans(abs(error_pct))
  matrix[2, i — 2] <- colMeans(error ^ 2) ^ 0.5
  for (k in 1:ncol(matrix)) {
   data <- matrix[, k]
   data <- round(data, 2)
   matrix[, k] <- data
  }
 }
 colnames(matrix) <- colnames(df[, 3:ncol(df)])
 rownames(matrix) <- c(“Mean Error”, “Mean Absolute Error”, “Mean Percent Error”, “Mean Absolute Percent Error”, “Root Mean Square Error”)
 print(matrix)
}
 
forecast_measures(pred_data)##                           nnet_pred prophet_pred h2o_pred lm_pred
## Mean Error                1912.17   1606.50      435.81   645.90
## Mean Absolute Error       10361.35  6117.22      5903.16  6650.85
## Mean Percent Error        6154.12   4069.55      3065.13  3925.26
## Mean Absolute Percent Er  0.26      0.18         0.12     0.17
## Root Mean Square Error    0.02      0.05        -0.02    -0.01

What this chunk does is subtract the predictions from the actual to get the errors for each week and use those errors to create this little table showing us a few performance metrics. Let’s run through them:

· Mean Error (ME): the average of all of the errors

· Mean Absolute Error (MAE): the average of the absolute values of the errors

· Mean Percent Error (MPE): the average of the percentage of the errors

· Mean Absolute Percent Error (MAPE): the average of the absolute values of the percentage of the errors

· Root Mean Squared Error (RMSE): square every error, take the mean of that and then take the square root of that and you’ve got yourself the RMSE

RMSE is considered the gold standard of performance metrics due to its ability to factor in negative values while also considering the size of errors. However, due to our goal of figuring out an acceptable range of error for our predictions, we’re also going to use MAPE to figure out which model is best.

So which is best?

Based on the table, all of our models did a great job with RMSE’s no more than .05 in either direction. Due to that, the decision will have to come down to MAPE and it’s obvious that the H2O model is the best choice. Now, 12% error isn’t incredible, but it’s not bad for a first pass. Plus, as we saw on the graph above, most of that error comes from the model’s underestimation of the holiday peaks. If we were to factor those dates out, the model would be an almost perfect match. So, much as we need the real H2O to have a future ourselves, we will need the H2O model to determine Walmart’s sales future.

6.0 Use the Best Model to Predict the Next Year’s Sales

# Create future dates and prepare them for prediction
future_dates <- data.frame(date = seq.Date(from = as.Date(“2012–10–26”), by = “week”, length.out = 52))
future_dates <- future_dates %>%
 mutate(is_holiday = ifelse(date %in% as.Date(c(“2012–11–23”, “2012–12–28”, “2013–02–15”, “2013–09–13”)), TRUE, FALSE))
 
future_dates_clean <- future_dates %>%
 tk_augment_timeseries_signature() %>%
 select_if(~ !is.Date(.)) %>%
 select_if(~ !any(is.na(.))) %>%
 mutate_if(is.ordered, ~ as.character(.) %>% as.factor)
 
future_dates_h2o <- as.h2o(future_dates_clean)
 
# Make the predictions
future_dates <- future_dates %>%
 add_column(weekly_sales = h2o.predict(h2o_leader, future_dates_h2o) %>% as_tibble() %>% pull(predict))
 
predicted_data <- store_1_dept_1 %>%
 rbind(future_dates) %>%
 mutate(prediction = factor(ifelse(date >= as.Date(“2012–10–26”), “predicted”, “actual”)))

It’s finally time to get an answer to our original question: what will be the total monthly sales for the next 12 months with a range to account for error?

The above chunks of code will help get us to that answer. To summarize, the above code created a dataset full of future dates (the next 52 weeks to be exact), turned it into a format H2O can recognize and then made its prediction. The following code will give us a nice visualization of that prediction.

# Plot the predictions
predicted_data %>%
 ggplot(aes(x = date, y = weekly_sales, color = prediction)) +
 geom_line() +
 scale_x_date(date_breaks = “4 months”, date_labels = “%B %d, %Y”) +
 scale_color_manual(values = c(“actual” = “#ff9de6”,
 “predicted” = “#a5d8f3”)) +
 theme_classic() +
 labs(
 title = “predicting future sales”,
 subtitle = “using the h2o model”,
 x = “date”,
 y = “weekly sales (in USD)”,
 caption = “data: walmart”
 ) +
 theme(
 text = element_text(family = “Futura Medium”),
 plot.title = element_text(hjust = .5),
 plot.subtitle = element_text(hjust = .5),
 axis.text.x = element_text(angle = 45, hjust = 1)
 )

We can see that the model still isn’t doing a great job predicting the magnitude of those peaks, but it’s done a great job at knowing when they will come, when they will go and what happens during the rest of the year. It’s slight but you can make out a downward trend from previous years in those summer months. Uh oh!

So we have a nice visual: great. That still doesn’t answer our question about total sales each month. Without further ado, let’s use the following chunk of code to make that happen.

# Calculate range of monthly sales
predicted_data %>%
 filter(prediction == “predicted”) %>%
 group_by(month(date)) %>%
 summarize(lower_range = sum(weekly_sales) — (sum(weekly_sales) * .13),
 forecast = sum(weekly_sales),
 upper_range = sum(weekly_sales) + (sum(weekly_sales) * .13)) %>%
 head(12)## # A tibble: 12 x 4
##   `month(date)` lower_range forecast upper_range
##   <dbl>         <dbl>       <dbl>    <dbl>
## 1 1              65576.      75375.   85173.
## 2 2             112467.     129272.  146078.
## 3 3              88323.     101521.  114719.
## 4 4             108597.     124824.  141051.
## 5 5              73234.      84177.   95120.
## 6 6              57893.      66543.   75194.
## 7 7              58268.      66974.   75681.
## 8 8              68003.      78164.   88325.
## 9 9              63507.      72996.   82486.
## 10 10           112214.     128982.  145749.
## 11 11           104516.     120134.  135751.
## 12 12           120187.     138146.  156105.

And there you have it. A nice little table showing the month number (January = 1, February = 2, etc.), the lower range of the estimate, the estimate itself and, finally, an upper range. We can clearly see the results of the holiday spikes with sales estimates almost doubling those of other months. What’s more interesting is the almost “holiday season” effect: despite there only being four holidays in the dataset, the months surrounding those dates see a residual boost in sales. This can probably be attributed to promotions before and after the holiday itself.

In Summary

In the matter of a few minutes and few more lines of code, we were able to turn thousands of data points into a concise and fairly safe estimate of the next 12 months of sales for Walmart’s S1D1. This could easily be transferred to a PowerPoint, a report or even another calculation for use at any level of your organization, retail or otherwise. And even better, we could reuse this code next year by simply updating our dates and, thus, making the forecasting process even simpler.

Things we could do to further improve this model and make it even more useful:

· We could calculate the error in the holiday peaks and see if we can create a variable or two that would help the model account for those outliers

· The original Walmart data came with one more dataset called “features” and we could use those features (store size, discounts, gas price, the Consumer Price Index, etc.) to further refine the model

· We could put this model into production and turn it into an interactive web app that business users within the organization could use to forecast sales for any store or department for however many weeks they wanted

Thank you for reading along and I hope you now better understand the simplicity that machine learning can bring to sales forecasting. I hope to see you in the next case study.