Walmart Store Sales Forecasting

Using Facebook Prophet and various machine learning algorithms(regression) for sales forecasting…

Sourav kumar
14 min readAug 23, 2021

Walmart is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery items from the United States, headquartered in Bentonville, Arkansas. Walmart is the world’s largest company by revenue, with US$ 548.743 billion, according to the Fortune Global 500 list in 2020.

In today’s era companies are moving towards data driven strategic decision, it helps them to plan various strategies prior to any specific occasion that ultimately increases the Sales.

Table of content

  1. Business problem
  2. Mapping Business Problem to Machine Learning Problem
  3. Performance metric to be used
  4. Why only this Performance metric?
  5. Exploratory Data Analysis(EDA)
  6. Data Preprocessing
  7. Baseline model without any feature engineering
  8. Feature Engineering
  9. Feature Importance using Random Forest
  10. Baseline model after Feature Engineering
  11. Time series model for forecasting i.e Facebook Prophet
  12. Various Machine Learning Models
  13. Final Model
  14. Deployment
  15. References
  16. Future Work
  17. EndNote

1.Business problem

1.1 Description:

One challenge of Walmart in retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line. We are provided with historical sales data for 45 Walmart stores located in different regions.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

Example of markdown events organised by Walmart

Credit : Kaggle

Problem Statement

Walmart has 45 stores and each store has 99 departments. We have data of weekly sales of each Store-Department pair from 2010–02–05 to 2012–11–01 and we have to predict weekly sales of each store-department pair from 2012–11–02 to 2013–07-26.

For more details of problem

2. Mapping Business Problem to Machine Learning Problem

2.1 Dataset overview:

It has 4 datasets.

a) train.csv- it has store,dept,date,weekly_sales,isHoliday which covers 2010–02–05 to 2012- 11–01.

Store — the store number

Dept — the department number

Date — the week

Weekly Sales — sales for the given department in the given store

IsHoliday — whether the week is a special holiday week

Example of train.csv

b)features.csv — This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store — the store number

Date — the week

Temperature — average temperature in the region

Fuel Price — cost of fuel in the region

MarkDown1–5 — anonymized data related to promotional markdowns that Walmart is running. Markdown data is only available after Nov 2011,and is not available for all stores all the time. Any missing value is marked with an NA.

CPI — the consumer price index

Unemployment — the unemployment rate

IsHoliday — whether the week is a special holiday week

Example of features.csv

c)stores.csv — This file contains anonymized information about the 45 stores, indicating the type and size of store.

Example of stores.csv

d)test.csv — This file is identical to train.csv, except we have withheld the weekly sales.You must predict the sales for each triplet of store, department, and date in this file

Example of test.csv

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

2.2 Type of Machine Learning Problem

a.It is a time series problem as it has date-wise weekly sales of around 2 years and we have to predict for next 1 year

b.It is also a regression problem as for a given triplet of store-dept-date we have to predict weekly sales which is a real value.

3.Performance metric to be used

Walmart has provided Weighted Mean Absolute Error (WMAE) metric, the mathematical function for which is shown below.

WMAE Formula

4.Why only this Performance metric ?

First of all we have see why we have not used RMSE as a performance metric here as it is most famous one then also we preferred MAE over it because of following reasons :
• The Dataset given is not large dataset so the impact of outlier points will impact more than usual. So, RMSE can increase the impact of outliers double almost.
• Lets take an example of two model:

M1 with points : 11,11,11,110

M2 with points : 14,14,,14,106

Actual points are : 10,10,10,100

RMSE of M1= 1/4*(sqrt(1+1+1+100)) =2.54

RMSE of M2= 1/4*(sqrt(16+16+16+36))=2.29

MAE of M1 = 1/4*(1+1+1+10) =3.25

MAE of M2 = 1/4*(4+4+4+6) =4.5

So, we can clearly see that RMSE of M1> RMSE of M2 and
MAE of M1< MAE of M2 and actually my model 1 is better as all outputs are close to actual one except one but RMSE is giving wrong results here

Now here we are using WMAE instead of normal MAE as we need to increase the impact of Holidays week sales as Walmart plans events according to the holidays coming. And clearly WMAE does that as it gives 5 times more weightage to the weeks where there is a holiday.

In short in RMSE the errors are squared means that a much larger weight is assigned to larger errors. So, an error of 10, is 100 times worse than an error of 1. When using the MAE, the error scales linearly. Therefore, an error of 10, is 10 times worse than an error of 1.So, just because of single outlier point I can get a huge error term which ultimately can increase the overall error and make model worse.

5.Exploratory Data Analysis(EDA)

we have merged features.csv, stores.csv and train.csv

There are three type of features available after merging the dataset:

features corresponding to each type of feature

a).Univariate analysis on Store column: — -> finding unique values of Store column

There are total 45 unique stores of Walmart all over place

b). Univariate analysis on Department column : — →finding unique values of Department column

There are total 99 unique department in each store of Walmart all over place

c). Univariate analysis on Weekly sales column : — →plotting it over date

Weekly sales grouped by Date

Inference
1.We can see that in month of November and December weekly sales are going very high which means that maybe holidays that are in November and December are creating a huge impact on weekly sales

2.There is slight dip in sales in last december and start of january also

3.There is a seasonality in plot of weekly sales over the period of time which can be easily seen by plot

4.Average weekly sales are fluctuating around 4.5*10⁷

d.)Univariate analysis on markdown1, markdown2, markdown3, markdown4, markdown5 column

Inference

1.We can see that till December 2011 there were no markdown values available

2.All markdown values are giving a spike at start like in month of january 2012 and in start markdown 3 is highest but after start markdown 1 is high throughout dataset

3.Markdown 3 has more values as compared to other markdown when we are seeing in 200000 datapoints .So,we can conclude that it has created a biggest impact as compared to other markdown on weekly sales.

4.We will consider it as a important feature in determining weekly sales.We will use correlation matrix also see whether this inference is true or not

e.)Univariate analysis on type column

Inference

We came to know that Type A stores have their medians higher than any other medians in other store types, so the weekly sales for store type A is more than other store types.

There are a total of 3 types of stores: Type A, Type Band Type

There are 45 stores in total.By piechart and barplot, we can say that type A store is the largest store and C is the smallest.

f).Analysis of ‘Type’ column with weekly sales

raw=final_data.groupby([‘Type’,’Date’,’IsHoliday’])[‘Weekly_Sales’].sum().reset_index()
type_A=raw[raw[‘Type’]==’A’]
type_B=raw[raw[‘Type’]==’B’]
type_C=raw[raw[‘Type’]==’C’]
sns.distplot(type_A[‘Weekly_Sales’],label=’type_A’)
sns.distplot(type_B[‘Weekly_Sales’],label=’type_B’)
sns.distplot(type_C[‘Weekly_Sales’],label=’type_C’)
sns.set(rc={‘figure.figsize’:(12.7,6.27)})
plt.legend()
plt.title(‘Distribution of weekly sales of type of store’)
plt.show()
weekly sales distribution of each type

Inference

1.It is very easy to separate the values of weekly sales values of all type of store as the values are completely in different zones

2.Mean value of type A is around 3*10⁷ and is Gaussian distributed

3.Mean value of type B is around 1.5*10⁷ and is Gaussian distributed

4.Mean value of type C is around 0.2*10⁷ and is Gaussian distributed .Its variance is also very less

d.)Analysis of weekly sales vs store with hue as holiday

data4 = pd.concat([final_data['Store'], final_data['Weekly_Sales'], final_data['IsHoliday']], axis=1)
plt.figure(figsize=(20,6))
plt.title('Box Plot of Weekly Sales by Store Number and Holiday')
fig = sns.boxplot(x='Store', y='Weekly_Sales', data=data4, showfliers=False, hue='IsHoliday')
weekly sales of holiday vs non-holiday week (store-wise)

1.From plot we can see that weekly sales of all store on holiday is higher as compared to non holiday week.

2.So,yes holiday weeks are creating huge impact and can be treated as important feature also in determining the weekly sales

6.Data Preprocessing

  1. Checking missing values and imputing with right imputation technique
We can see that there is no null values in the train dataset
Null values are in markdown1,markdown2 markdown3, markdown4, markdown5 ,CPI and unemployment

Inference :

We can see that all the null values in CPI and unemployment columns are after 2013–05–03 and in our train dataset or merged dataset contains date up to 2012–10–26.

So, from here we can conclude that these features are for test dataset as for test dataset we don’t have features in it so we have to merge these column and impute this CPI and unemployment value to get better score .

Imputation of markdown column in final train data

Null values are in markdown also but it may mean that till that date no markdown events have organised so we will impute 0 there.

final_data.loc[final_data.MarkDown1.isnull() ,'MarkDown1']= 0
final_data.loc[final_data.MarkDown2.isnull() ,'Markdown2']= 0
final_data.loc[final_data.MarkDown3.isnull() ,'Markdown3']= 0
final_data.loc[final_data.MarkDown4.isnull() ,'Markdown4']= 0
final_data.loc[final_data.MarkDown5.isnull() ,'Markdown5']= 0

2.Imputation of CPI and Unemployment column in features data

Problem : We are having NaN values for CPI and Unemployment column after 2012–11–01 which is basically in the test data we don’t have CPI values

Solution : We will use time series model i.e Facebook Prophet to predict CPI and Unemployment for the test data based on train data i.e available to us

4.checking negative values of non-negative feature and removing those rows

final_data[final_data.Weekly_Sales<0]
we can clearly see that weekly sales are negative values

Inference

1.We have seen that 1285 rows were there where weekly sales were negative so I will replace those rows.

2.Rest all non-negative features like CPI, Unemployment, Fuel price doesn't have value<0.So no need to remove any rows there.

7. Baseline model without any feature engineering

First, I will split the data into train and test into 70–30%

final_data['Date']=pd.to_numeric(pd.to_datetime(final_data['Date']))
y = final_data['Weekly_Sales']
X = final_data.drop(['Weekly_Sales'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # Train:Test = 70:30 splitting.
# Final shapes.
print('Train shape:', X_train.shape, y_train.shape)
print('Test shape: ', X_test.shape, y_test.shape)

Defining the WMAE function for calculation WMAE score.

Now apply a simple regression model over it.

Key takeaway:

We will keep this WMAE score as a baseline for our further calculation, Now I have to do feature engineering and then apply some better models to reduce thus wmae score.

8. Feature Engineering

I will build some new features from the existing ones and doing encoding of categorical features…

1.First will be temperature bins which will be the categorical feature from the numerical continuous feature i.e Temperature

temp_90_100_f=final_data[((final_data.Temperature>90) & (final_data.Temperature< 100))].Weekly_Sales.sum()
temp_80_90_f=final_data[((final_data.Temperature>80) & (final_data.Temperature< 90))].Weekly_Sales.sum()
temp_70_80_f=final_data[((final_data.Temperature>70) & (final_data.Temperature< 80))].Weekly_Sales.sum()
temp_60_70_f=final_data[((final_data.Temperature>60) & (final_data.Temperature< 70))].Weekly_Sales.sum()
temp_50_60_f=final_data[((final_data.Temperature>50) & (final_data.Temperature< 60))].Weekly_Sales.sum()
temp_40_50_f=final_data[((final_data.Temperature>40) & (final_data.Temperature< 50))].Weekly_Sales.sum()
temp_30_40_f=final_data[((final_data.Temperature>30) & (final_data.Temperature< 40))].Weekly_Sales.sum()
temp=[temp_90_100_f,temp_80_90_f,temp_70_80_f,temp_60_70_f,temp_50_60_f,temp_40_50_f,temp_30_40_f]
temp_series=pd.Series(temp,index=['90-100','80-90','70-80','60-70','50-60','40-50','30-40'])

Now I will plot this to see what conclusion can be made for this feature.

Inference

1.Basically I will do label encoding but on basis of weekly sales like here we can see that the 60_70 slot is having the highest weekly sales,so I am giving it label 1

2.90_100 slot is having lowest weekly sales ,so I am giving it last label i.e 7.So, I have sorted each and every slots and given them label accordingly. So, we will create this new feature.

def temp(data):
data["Temp_bins"] = np.nan
data.loc[((final_data.Temperature>30) & (data.Temperature< 40)) ,'Temp_bins']= 649176675.9499999
data.loc[((final_data.Temperature>40) & (data.Temperature< 50)) ,'Temp_bins']= 964084701.44
data.loc[((final_data.Temperature>50) & (data.Temperature< 60)) ,'Temp_bins']= 1122389636.8200002
data.loc[((final_data.Temperature>60) & (data.Temperature< 70)) ,'Temp_bins']= 1292714835.6399999
data.loc[((final_data.Temperature>70) & (data.Temperature< 80)) ,'Temp_bins']= 1289690535.0500002
data.loc[((final_data.Temperature>80) & (data.Temperature< 90)) ,'Temp_bins']= 851502899.1800001
data.loc[((final_data.Temperature>90) & (data.Temperature< 100)),'Temp_bins']= 143672161.36
data.loc[data.Temp_bins.isnull() ,'Temp_bins']= 0
return data
final_data=temp(final_data)

2.Splitting Date into month, year,day, week

# Form Date, Year, Month, Week, Day
def split(data):
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
data['Month']= data['Date'].dt.month
data['Week'] = data['Date'].dt.week
data['Day'] = data['Date'].dt.day
return data
final_data=split(final_data)

3..Number of days left from Christmas

import datetime
def days_from_Christmas(x):
if x['Year']== 2010 :
diff=datetime.datetime(2010, 12, 31)-x['Date']
return diff.days
if ((x['Year']== 2011) and (x['Date']< datetime.datetime(2011, 12, 30))):
diff=datetime.datetime(2011, 12, 30)-x['Date']
return diff.days
else:
return 0
final_data['diff_from_Christmas'] = final_data.apply(days_from_Christmas_for_train, axis=1)

4.Number of days left from Thanksgiving

def days_from_thanksgiving(x):
if ((x['Year']== 2010) and (x['Date']< datetime.datetime(2010, 11, 26))):
diff=datetime.datetime(2010, 11, 26)-x['Date']
return diff.days
if ((x['Year']== 2011) and (x['Date']< datetime.datetime(2011, 11, 25))):
diff=datetime.datetime(2011, 11, 25)-x['Date']
return diff.days
else:
return 0
final_data['days_from_thanksgiving'] = final_data.apply(days_from_thanksgiving_for_train, axis=1)

5.Doing label encoding based on 4 different holiday week

We can see that there is only 4 weeks which has holiday means for every year this week is the only week which contains special holiday so we will create a new feature in which we will keep this week and if it has holiday.

def holiday_type(x):
if (x['IsHoliday']== 1) & (x['Week']==6):
return 1 #SuperBowl
elif (x['IsHoliday']== 1) & (x['Week']==36):
return 2 #LaborDay
elif (x['IsHoliday']== 1) & (x['Week']==47):
return 3 #Thanksgiving
elif (x['IsHoliday']== 1) & (x['Week']==52):
return 4 #Christmas
else:
return 0
#converting IsHoliday into 0 and 1
final_data['IsHoliday_bins'] = final_data.apply(holiday_type, axis=1)

6. Counting number of holidays in a week for US.

dates =[]
for ptr in holidays.US(years = 2010).items():
dates.append(ptr[0])
for ptr in holidays.US(years = 2011).items():
dates.append(ptr[0])
for ptr in holidays.US(years = 2012).items():
dates.append(ptr[0])
holiday_count=[]
for index, row in final_data.iterrows():
dat = final_data['Date'][index]
dt=[]
for i in range(0,5):
dt.append(dat - datetime.timedelta(days = i))
for i in range(1,3):
dt.append(dat + datetime.timedelta(days = i))
count = 0
for date in dates:
if date in dt:
count +=1
holiday_count.append(count)
final_data['IsHoliday_bins'] = final_data.apply(holiday_type, axis=1)

7.Doing label encoding of Type feature

final_data.loc[(final_data.Type=='A') ,'Type']= 1
final_data.loc[(final_data.Type=='B') ,'Type']= 2
final_data.loc[(final_data.Type=='C') ,'Type']= 3

8.Doing label encoding of Type feature

final_data.loc[(final_data.IsHoliday==True) ,'IsHoliday']= 1
final_data.loc[(final_data.IsHoliday==False) ,'IsHoliday']= 0

9. Feature Importance using Random Forest

Feature importance in decreasing order

The top 10 features according to the model RandomForestRegressor in decreasing order are :

Dept>Size>Store>Week>CPI>Type>Unemployment>Temperature>Day>IsHoliday_bins

10.Baseline model after Feature Engineering

We can see that there is some decrease in WMAE score

11.Time series model for forecasting i.e Facebook Prophet

What is Prophet?

“Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series data analysis be available without expert works, it is highly user-friendly but still highly customizable, even to non-expert users. How lovely!!

Prophet can handle;

  • the trend with its changepoints,
  • seasonality (yearly, weekly, daily, and other user-defined seasonality),
  • holiday effect, and
  • input regressors

In our case, I have taken input regressor basically we can add features to increase the accuracy of prediction.

Source code:

From Facebook Prophet, I got test WMAE Score of 3731

12.Various Machine Learning Models

We have used these machine learning algorithms after doing hyperparameter tunning to calculate the WMAE score.

  1. Linear Regression
  2. KNeighborsRegressor
  3. RandomForestRegressor
  4. LGBMRegressor
  5. CatBoostRegressor
  6. XGBRegressor
  7. StackingRegressor

Important key-analysis: I got better wmae score when we used feature based on our domain knowledge

The top 10 features selected based on model-based feature importance are:

['Store','Dept','Size','Week','Type','Day''IsHoliday_bins','Temperature','CPI','Unemployment']

The top features selected based on domain knowledge are:

['Store','Dept','IsHoliday','Size','Week','Type','Day','Year','Holidays']

13.Final Model

Our final model will be RandomForestRegressor as it is giving the lowest WMAE score on our Train data.So, I will use this model for the prediction of weekly sales on test data i.e unseen data.

from sklearn.ensemble import RandomForestRegressorfinal_data_train=final_data_train[['Store','Dept','IsHoliday','Size','Week','Type','Year','Weekly_Sales','Holidays','Day']]final_data_test=final_data_test[['Store','Dept','IsHoliday','Size','Week','Type','Year','Holidays','Day']]final_data_train['IsHoliday']=final_data_train['IsHoliday'].astype('bool')final_data_test['IsHoliday']=final_data_test['IsHoliday'].astype('bool')final_data_train['Type']=final_data_train['Type'].astype('int')final_data_test['Type']=final_data_test['Type'].astype('int')y = final_data_train['Weekly_Sales']X = final_data_train.drop(['Weekly_Sales'], axis=1)rf_Model = RandomForestRegressor(n_estimators= 140,max_depth=27,n_jobs = -1)rf_Model.fit(X, y)y_hat= rf_Model.predict(final_data_test)

Kaggle submission

I have scored WMAE score of 2638.69 on Kaggle submission and ranked under the top 5% on the leaderboard.

14.Deployment

I have deployed this ML model using flask API on localhost. I have taking store number, department number is the week is a holiday or not, and what is the date and then finding all rest features with help of this data and using the model. pkl file to predict on given raw data.

15.References

16.Future Work

  1. We can try various deep learning techniques like RNN, LSTM, GRU.
  2. There is a lot to explore in facebook prophet,I have just used the default one along with one extra addon i.e adding Regressor but there is more to add like holiday effect, seasonality (yearly, weekly, daily, and other user-defined seasonality).

17.EndNote

I would like to thank team Applied AI for their tremendous support and guideline because of which I reached up to this position to do a complete end to end case study on such a interesting problem statement.

You can find complete code on my Github here

Happy to connect with you on LinkedIn

--

--