Walmart Sales Forecasting

Aditya Bhosle
Analytics Vidhya
Published in
10 min readSep 25, 2019

Simple Model averages can leverage the performance and accuracy of a problem(here sales) that too without deep feature engineering.

Introduction

Predicting future sales for a company is one of the most important aspects of strategic planning. And Walmart is the best example to work with as a beginner as it has the most retail data set. Also, Walmart used this sales prediction problem for recruitment purposes too.

The data collected ranges from 2010 to 2012, where 45 Walmart stores across the country were included in this analysis. Each store contains several departments, and we are tasked with predicting the department-wide sales for each store. It is important to note that we also have external data available like CPI, Unemployment Rate and Fuel Prices in the region of each store which, hopefully, helps us to make a more detailed analysis.

Dataset Overview

This data set is available on the kaggle website. These data sets contained information about the stores, departments, temperature, unemployment, CPI, isHoliday, and MarkDowns.

Stores :
Store: The store number. Range from 1–45.
Type: Three types of stores ‘A’, ‘B’ or ‘C’.
Size: Sets the size of a Store would be calculated by the no. of products available in the particular store ranging from 34,000 to 210,000.

Features:
Temperature: Temperature of the region during that week.
Fuel_Price: Fuel Price in that region during that week.
MarkDown1:5 : Represents the Type of markdown and what quantity was available during that week.
CPI: Consumer Price Index during that week.
Unemployment: The unemployment rate during that week in the region of the store.

Sales:
Date: The date of the week where this observation was taken.
Weekly_Sales: The sales recorded during that Week.
Dept: One of 1–99 that shows the department.
IsHoliday: a Boolean value representing a holiday week or not.

features

Total we have 421570 values for training and 115064 for testing as part of the competition. But we will work only on 421570 data as we have labels to test the performance and accuracy of models.

Data manipulation

  1. Checking for null values
feat.isnull().sum()

As we have few NaN for CPI and Unemployment, therefore we fill the missing values with their respective column mean. And as MarkDowns have more missing values we impute zeros in missing places respectively

from statistics import meanfeat['CPI'] = feat['CPI'].fillna(mean(feat['CPI']))
feat['Unemployment'] = feat['Unemployment'].fillna(mean(feat['Unemployment']))
feat['MarkDown1'] = feat['MarkDown1'].fillna(0)
feat['MarkDown2'] = feat['MarkDown2'].fillna(0)
feat['MarkDown3'] = feat['MarkDown3'].fillna(0)
feat['MarkDown4'] = feat['MarkDown4'].fillna(0)
feat['MarkDown5'] = feat['MarkDown5'].fillna(0)

Merging(adding) all features with training data

new_data = pd.merge(feat, data, on=['Store','Date','IsHoliday'], how='inner')# merging(adding) all stores info with new training data
final_data = pd.merge(new_data,stores,how='inner',on=['Store'])

As the data is Time-Series we sort them in ascending order so that the model can perform on the historical data.

Any metric that is measured over regular time intervals forms a time series. Analysis of time series is commercially importance because of industrial need and relevance especially w.r.t forecasting.

# sorting data with respect to date
final_data = final_data.sort_values(by='Date')

dimensions of this manipulated dataset are (421570, 16).

Exploratory Data Analysis

There are a total of 3 types of stores: Type A, Type Band Type C.
There are 45 stores in total.

sizes=grouped.count()['Size'].round(1)
print(sizes)
labels = 'A store','B store','C store'
sizes = [(22/(45))*100,(17/(45))*100,(6/(45))*100]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
pie-chart for the visual representation of store types
# boxplot for sizes of types of storesstore_type = pd.concat([stores['Type'], stores['Size']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='Type', y='Size', data=store_type)
box plot for store type vs Size
  • By boxplot and piechart, we can say that type A store is the largest store and C is the smallest
  • There is no overlapped area in size among A, B, and C.\

boxplot for weekly sales for different types of stores :

store_sale = pd.concat([stores['Type'], data['Weekly_Sales']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='Type', y='Weekly_Sales', data=store_sale, showfliers=False)
box plot for store type vs weekly_sales
  • The median of A is the highest and C is the lowest i.e stores with more sizes have higher sales

Sales on holiday is a little bit more than sales in not-holiday

# total count of sales on holidays and non holidays
print('sales on non-holiday : ',data[data['IsHoliday']==False]['Weekly_Sales'].count().round(1))
print('sales on holiday : ',data[data['IsHoliday']==True]['Weekly_Sales'].count().round(1))
box plot for holiday/non-holiday vs weekly sales

Correlations among features :

Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1.

A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a — sign indicates a negative relationship. Usually, in statistics, we measure four types of correlations: Pearson correlation, Kendall rank correlation, and Spearman correlation. The graph below will give you an idea about correlation.

# Plotting correlation between all important features
corr = final_data.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=True)
plt.plot()

Splitting Date into features

# Add column for year
final_data["Year"] = pd.to_datetime(final_data["Date"], format="%Y-%m-%d").dt.year
final_test_data["Year"] = pd.to_datetime(final_test_data["Date"], format="%Y-%m-%d").dt.year
# Add column for day
final_data["Day"] = pd.to_datetime(final_data["Date"], format="%Y-%m-%d").dt.day
final_test_data["Day"] = pd.to_datetime(final_test_data["Date"], format="%Y-%m-%d").dt.day
# Add column for days to next Christmas
final_data["Days to Next Christmas"] = (pd.to_datetime(final_data["Year"].astype(str)+"-12-31", format="%Y-%m-%d") -
pd.to_datetime(final_data["Date"], format="%Y-%m-%d")).dt.days.astype(int)
final_test_data["Days to Next Christmas"] = (pd.to_datetime(final_test_data["Year"].astype(str) + "-12-31", format="%Y-%m-%d") -
pd.to_datetime(final_test_data["Date"], format="%Y-%m-%d")).dt.days.astype(int)

Splitting Store type into categorical features.

As we have 3 types of stores (A,B and C) which are categorical. Therefore splitting wach type as a feature into one-hot encoding

tp = pd.get_dummies(X.Type)
X = pd.concat([X, tp], axis=1)
X = X.drop(columns='Type')

Therefore we have total 15 features :
- Store
- Temperature
- Fuel_Price
- CPI
- Unemployment
- Dept
- Size
- IsHoliday
- MarkDown3
- Year
- Days
- Days Next to Christmas
- A , B, C

Building train-test set

splitting final data into train and test. We kept 80%of train data and 20% test data.

#train-test split
X_train,X_test,y_train,y_test=train_test_split( X, y, test_size=0.20, random_state=0)

Out of 421570, training data consists of 337256 and test data consists of 84314 with a total of 15 features.

Machine Learning Models

We are going to use different models to test the accuracy and will finally train the whole data to check the score against kaggle competition.

Standardizing train and test data :

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

1) KNN Regressor

Out of all the machine learning algorithms I have come across, KNN has easily been the simplest to pick up. KNN can be used for both classification and regression problems. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.

from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=10,n_jobs=4)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
scatter plot for predicted values

Accuracy KNNRegressor: 56.78497373157646 %

2 ) Decision tree Regessor

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=0)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)

accuracy DTR: 96.20101070234142 %

3) Random Forest Regressor

Random forest is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The number of features that can be split on at each node is limited to some percentage of the total (which is known as the hyperparameter)

# After Hyper-parameter tunning 
rfr = RandomForestRegressor(n_estimators = 400,max_depth=15,n_jobs=5)
rfr.fit(X_train,y_train)
y_pred=rfr.predict(X_test)

accuracy RandomForestRegressor: 96.56933672047487 %

4) XGBRegressor

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. XGBRegressor Handling sparse data.XGBoost has a distributed weighted quantile sketch algorithm to effectively handle weighted data. For faster computing, XGBoost can make use of multiple cores on the CPU. This is possible because of a block structure in its system design. Data is sorted and stored in in-memory units called blocks. Hyperparameters are objective, n_estimators, max_depth, learning_rate.

xgb_clf = XGBRegressor(objective='reg:linear', nthread= 4, n_estimators= 500, max_depth= 6, learning_rate= 0.5) 
xb = xgb_clf.fit(X_train,y_train)
y_pred=xgb_clf.predict(X_test)

accuracy XGBRegressor: 97.21754267971075 %

4) ExtraTreesRegressor

The Extra-Tree method (standing for extremely randomized trees) was proposed with the main objective of further randomizing tree building in the context of numerical input features, where the choice of the optimal cut-point is responsible for a large proportion of the variance of the induced tree. With respect to random forests, the method drops the idea of using bootstrap copies of the learning sample, and instead of trying to find an optimal cut-point for each one of the K randomly chosen features at each node, it selects a cut-point at random.

from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor(n_estimators=30,n_jobs=4)
etr.fit(X_train,y_train)
y_pred=etr.predict(X_test)

Accuracy ExtraTreesRegressor: 96.40934076228986 %

All model Comparison :

from prettytable import PrettyTable

x = PrettyTable()
x.field_names = ["Model", "MAE", "RMSE", "Accuracy"]x.add_row(["Linear Regression (Baseline)", 14566, 21767, 8.89])
x.add_row(["KNNRegressor", 8769, 14991, 56.87])
x.add_row(["DecisionTreeRegressor", 2375, 7490, 96.02])
x.add_row(["RandomForestRegressor", 1854, 5785, 96.56])
x.add_row(["ExtraTreeRegressor", 1887, 5684, 96.42])
x.add_row(["XGBRegressor", 2291, 5205,97.23 ])
print(x)

Getting averages of best models

The trick is to get the average of the top n best models. The n top models are decided by their accuracy and rmse. Here we have taken 4 models as their accuracies are more than 95%. The models are DecisionTreeRegressor, RandomForestRegressor, XGBRegressor and ExtraTreesRegressor.

Note that just taking top models doesn’t mean they are not overfitting. This can be verified by checking RMSE or MAE. In the case of a classification problem, we can use the confusion matrix. Also, there should not be much difference in test accuracy and train accuracy.

# training top n models
dt = DecisionTreeRegressor(random_state=0)
etr = ExtraTreesRegressor(n_estimators=30,n_jobs=4)
xgb_clf = XGBRegressor(objective='reg:linear', nthread= 4, n_estimators= 500, max_depth= 6, learning_rate= 0.5)
rfr = RandomForestRegressor(n_estimators = 400,max_depth=15,n_jobs=4)
dt.fit(X_train,y_train)
etr.fit(X_train,y_train)
xgb_clf.fit(X_train,y_train)
rfr.fit(X_train,y_train)
# predicting on test data
etr_pred=etr.predict(X_test)
xgb_clf_pred=xgb_clf.predict(X_test)
rfr_pred=rfr.predict(X_test)
dt_pred = dt.predict(X_test)

Getting averages of models :

final = (etr_pred + xgb_clf_pred + rfr_pred + dt_pred)/4.0
prediction of the final model

Conclusion

Here we can see that our RMSE reduced in comparison to our best performing single model i.e. XGBRegressor with RMSE of 3804. Hence we can conclude that taking averages of top n models helps in reducing loss.

As here available data is less, so loss difference is not extraordinary . But in large datasets of sizes in Gigabytes and Terabytes, this trick of simple averaging may reduce the loss to a great extent.

Kaggle Score

Now without splitting the whole data into a train-test, training it on the same and testing it on future data provided by kaggle gives a score in the range of 3000 without much deep feature engineering and rigorous hypertuning.

Future Work

  • Modifying date feature into days, month, weeks.
  • The dataset includes special occasions i.e Christmas, pre-Christmas, black Friday, Labour day, etc. On these days people tend to shop more than usual days. So adding these as a feature to data will also improve accuracy to a great extent.
  • Also there are a missing value gap between training data and test data with 2 features i.e. CPI and Unemployment. If that gap is reduced then also performance can be improved.

References :

Thank you for your attention and reading my work

If you liked this story, share it with your friends and colleagues !

Also, follow me on

--

--