Walmart Store Sales Forecasting
Using Facebook Prophet and various machine learning algorithms(regression) for sales forecasting…
Walmart is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery items from the United States, headquartered in Bentonville, Arkansas. Walmart is the world’s largest company by revenue, with US$ 548.743 billion, according to the Fortune Global 500 list in 2020.
In today’s era companies are moving towards data driven strategic decision, it helps them to plan various strategies prior to any specific occasion that ultimately increases the Sales.
Table of content
- Business problem
- Mapping Business Problem to Machine Learning Problem
- Performance metric to be used
- Why only this Performance metric?
- Exploratory Data Analysis(EDA)
- Data Preprocessing
- Baseline model without any feature engineering
- Feature Engineering
- Feature Importance using Random Forest
- Baseline model after Feature Engineering
- Time series model for forecasting i.e Facebook Prophet
- Various Machine Learning Models
- Final Model
- Deployment
- References
- Future Work
- EndNote
1.Business problem
1.1 Description:
One challenge of Walmart in retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line. We are provided with historical sales data for 45 Walmart stores located in different regions.
In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.
Credit : Kaggle
Problem Statement
Walmart has 45 stores and each store has 99 departments. We have data of weekly sales of each Store-Department pair from 2010–02–05 to 2012–11–01 and we have to predict weekly sales of each store-department pair from 2012–11–02 to 2013–07-26.
2. Mapping Business Problem to Machine Learning Problem
2.1 Dataset overview:
It has 4 datasets.
a) train.csv- it has store,dept,date,weekly_sales,isHoliday which covers 2010–02–05 to 2012- 11–01.
Store — the store number
Dept — the department number
Date — the week
Weekly Sales — sales for the given department in the given store
IsHoliday — whether the week is a special holiday week
b)features.csv — This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
Store — the store number
Date — the week
Temperature — average temperature in the region
Fuel Price — cost of fuel in the region
MarkDown1–5 — anonymized data related to promotional markdowns that Walmart is running. Markdown data is only available after Nov 2011,and is not available for all stores all the time. Any missing value is marked with an NA.
CPI — the consumer price index
Unemployment — the unemployment rate
IsHoliday — whether the week is a special holiday week
c)stores.csv — This file contains anonymized information about the 45 stores, indicating the type and size of store.
d)test.csv — This file is identical to train.csv, except we have withheld the weekly sales.You must predict the sales for each triplet of store, department, and date in this file
For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):
Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
2.2 Type of Machine Learning Problem
a.It is a time series problem as it has date-wise weekly sales of around 2 years and we have to predict for next 1 year
b.It is also a regression problem as for a given triplet of store-dept-date we have to predict weekly sales which is a real value.
3.Performance metric to be used
Walmart has provided Weighted Mean Absolute Error (WMAE) metric, the mathematical function for which is shown below.
4.Why only this Performance metric ?
First of all we have see why we have not used RMSE as a performance metric here as it is most famous one then also we preferred MAE over it because of following reasons :
• The Dataset given is not large dataset so the impact of outlier points will impact more than usual. So, RMSE can increase the impact of outliers double almost.
• Lets take an example of two model:
M1 with points : 11,11,11,110
M2 with points : 14,14,,14,106
Actual points are : 10,10,10,100
RMSE of M1= 1/4*(sqrt(1+1+1+100)) =2.54
RMSE of M2= 1/4*(sqrt(16+16+16+36))=2.29
MAE of M1 = 1/4*(1+1+1+10) =3.25
MAE of M2 = 1/4*(4+4+4+6) =4.5
So, we can clearly see that RMSE of M1> RMSE of M2 and
MAE of M1< MAE of M2 and actually my model 1 is better as all outputs are close to actual one except one but RMSE is giving wrong results here
Now here we are using WMAE instead of normal MAE as we need to increase the impact of Holidays week sales as Walmart plans events according to the holidays coming. And clearly WMAE does that as it gives 5 times more weightage to the weeks where there is a holiday.
In short in RMSE the errors are squared means that a much larger weight is assigned to larger errors. So, an error of 10, is 100 times worse than an error of 1. When using the MAE, the error scales linearly. Therefore, an error of 10, is 10 times worse than an error of 1.So, just because of single outlier point I can get a huge error term which ultimately can increase the overall error and make model worse.
5.Exploratory Data Analysis(EDA)
There are three type of features available after merging the dataset:
a).Univariate analysis on Store column: — -> finding unique values of Store column
b). Univariate analysis on Department column : — →finding unique values of Department column
c). Univariate analysis on Weekly sales column : — →plotting it over date
Inference
1.We can see that in month of November and December weekly sales are going very high which means that maybe holidays that are in November and December are creating a huge impact on weekly sales
2.There is slight dip in sales in last december and start of january also
3.There is a seasonality in plot of weekly sales over the period of time which can be easily seen by plot
4.Average weekly sales are fluctuating around 4.5*10⁷
d.)Univariate analysis on markdown1, markdown2, markdown3, markdown4, markdown5 column
Inference
1.We can see that till December 2011 there were no markdown values available
2.All markdown values are giving a spike at start like in month of january 2012 and in start markdown 3 is highest but after start markdown 1 is high throughout dataset
3.Markdown 3 has more values as compared to other markdown when we are seeing in 200000 datapoints .So,we can conclude that it has created a biggest impact as compared to other markdown on weekly sales.
4.We will consider it as a important feature in determining weekly sales.We will use correlation matrix also see whether this inference is true or not
e.)Univariate analysis on type column
Inference
We came to know that Type A stores have their medians higher than any other medians in other store types, so the weekly sales for store type A is more than other store types.
There are a total of 3 types of stores: Type A, Type Band Type
There are 45 stores in total.By piechart and barplot, we can say that type A store is the largest store and C is the smallest.
f).Analysis of ‘Type’ column with weekly sales
raw=final_data.groupby([‘Type’,’Date’,’IsHoliday’])[‘Weekly_Sales’].sum().reset_index()
type_A=raw[raw[‘Type’]==’A’]
type_B=raw[raw[‘Type’]==’B’]
type_C=raw[raw[‘Type’]==’C’]
sns.distplot(type_A[‘Weekly_Sales’],label=’type_A’)
sns.distplot(type_B[‘Weekly_Sales’],label=’type_B’)
sns.distplot(type_C[‘Weekly_Sales’],label=’type_C’)
sns.set(rc={‘figure.figsize’:(12.7,6.27)})
plt.legend()
plt.title(‘Distribution of weekly sales of type of store’)
plt.show()
Inference
1.It is very easy to separate the values of weekly sales values of all type of store as the values are completely in different zones
2.Mean value of type A is around 3*10⁷ and is Gaussian distributed
3.Mean value of type B is around 1.5*10⁷ and is Gaussian distributed
4.Mean value of type C is around 0.2*10⁷ and is Gaussian distributed .Its variance is also very less
d.)Analysis of weekly sales vs store with hue as holiday
data4 = pd.concat([final_data['Store'], final_data['Weekly_Sales'], final_data['IsHoliday']], axis=1)
plt.figure(figsize=(20,6))
plt.title('Box Plot of Weekly Sales by Store Number and Holiday')
fig = sns.boxplot(x='Store', y='Weekly_Sales', data=data4, showfliers=False, hue='IsHoliday')
1.From plot we can see that weekly sales of all store on holiday is higher as compared to non holiday week.
2.So,yes holiday weeks are creating huge impact and can be treated as important feature also in determining the weekly sales
6.Data Preprocessing
- Checking missing values and imputing with right imputation technique
Inference :
We can see that all the null values in CPI and unemployment columns are after 2013–05–03 and in our train dataset or merged dataset contains date up to 2012–10–26.
So, from here we can conclude that these features are for test dataset as for test dataset we don’t have features in it so we have to merge these column and impute this CPI and unemployment value to get better score .
Imputation of markdown column in final train data
Null values are in markdown also but it may mean that till that date no markdown events have organised so we will impute 0 there.
final_data.loc[final_data.MarkDown1.isnull() ,'MarkDown1']= 0
final_data.loc[final_data.MarkDown2.isnull() ,'Markdown2']= 0
final_data.loc[final_data.MarkDown3.isnull() ,'Markdown3']= 0
final_data.loc[final_data.MarkDown4.isnull() ,'Markdown4']= 0
final_data.loc[final_data.MarkDown5.isnull() ,'Markdown5']= 0
2.Imputation of CPI and Unemployment column in features data
Problem : We are having NaN values for CPI and Unemployment column after 2012–11–01 which is basically in the test data we don’t have CPI values
Solution : We will use time series model i.e Facebook Prophet to predict CPI and Unemployment for the test data based on train data i.e available to us
4.checking negative values of non-negative feature and removing those rows
final_data[final_data.Weekly_Sales<0]
Inference
1.We have seen that 1285 rows were there where weekly sales were negative so I will replace those rows.
2.Rest all non-negative features like CPI, Unemployment, Fuel price doesn't have value<0.So no need to remove any rows there.
7. Baseline model without any feature engineering
First, I will split the data into train and test into 70–30%
final_data['Date']=pd.to_numeric(pd.to_datetime(final_data['Date']))
y = final_data['Weekly_Sales']
X = final_data.drop(['Weekly_Sales'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # Train:Test = 70:30 splitting.
# Final shapes.
print('Train shape:', X_train.shape, y_train.shape)
print('Test shape: ', X_test.shape, y_test.shape)
Defining the WMAE function for calculation WMAE score.
Now apply a simple regression model over it.
Key takeaway:
We will keep this WMAE score as a baseline for our further calculation, Now I have to do feature engineering and then apply some better models to reduce thus wmae score.
8. Feature Engineering
I will build some new features from the existing ones and doing encoding of categorical features…
1.First will be temperature bins which will be the categorical feature from the numerical continuous feature i.e Temperature
temp_90_100_f=final_data[((final_data.Temperature>90) & (final_data.Temperature< 100))].Weekly_Sales.sum()
temp_80_90_f=final_data[((final_data.Temperature>80) & (final_data.Temperature< 90))].Weekly_Sales.sum()
temp_70_80_f=final_data[((final_data.Temperature>70) & (final_data.Temperature< 80))].Weekly_Sales.sum()
temp_60_70_f=final_data[((final_data.Temperature>60) & (final_data.Temperature< 70))].Weekly_Sales.sum()
temp_50_60_f=final_data[((final_data.Temperature>50) & (final_data.Temperature< 60))].Weekly_Sales.sum()
temp_40_50_f=final_data[((final_data.Temperature>40) & (final_data.Temperature< 50))].Weekly_Sales.sum()
temp_30_40_f=final_data[((final_data.Temperature>30) & (final_data.Temperature< 40))].Weekly_Sales.sum()temp=[temp_90_100_f,temp_80_90_f,temp_70_80_f,temp_60_70_f,temp_50_60_f,temp_40_50_f,temp_30_40_f]
temp_series=pd.Series(temp,index=['90-100','80-90','70-80','60-70','50-60','40-50','30-40'])
Now I will plot this to see what conclusion can be made for this feature.
Inference
1.Basically I will do label encoding but on basis of weekly sales like here we can see that the 60_70 slot is having the highest weekly sales,so I am giving it label 1
2.90_100 slot is having lowest weekly sales ,so I am giving it last label i.e 7.So, I have sorted each and every slots and given them label accordingly. So, we will create this new feature.
def temp(data):
data["Temp_bins"] = np.nan
data.loc[((final_data.Temperature>30) & (data.Temperature< 40)) ,'Temp_bins']= 649176675.9499999
data.loc[((final_data.Temperature>40) & (data.Temperature< 50)) ,'Temp_bins']= 964084701.44
data.loc[((final_data.Temperature>50) & (data.Temperature< 60)) ,'Temp_bins']= 1122389636.8200002
data.loc[((final_data.Temperature>60) & (data.Temperature< 70)) ,'Temp_bins']= 1292714835.6399999
data.loc[((final_data.Temperature>70) & (data.Temperature< 80)) ,'Temp_bins']= 1289690535.0500002
data.loc[((final_data.Temperature>80) & (data.Temperature< 90)) ,'Temp_bins']= 851502899.1800001
data.loc[((final_data.Temperature>90) & (data.Temperature< 100)),'Temp_bins']= 143672161.36
data.loc[data.Temp_bins.isnull() ,'Temp_bins']= 0
return data
final_data=temp(final_data)
2.Splitting Date into month, year,day, week
# Form Date, Year, Month, Week, Day
def split(data):
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
data['Month']= data['Date'].dt.month
data['Week'] = data['Date'].dt.week
data['Day'] = data['Date'].dt.day
return data
final_data=split(final_data)
3..Number of days left from Christmas
import datetime
def days_from_Christmas(x):
if x['Year']== 2010 :
diff=datetime.datetime(2010, 12, 31)-x['Date']
return diff.days
if ((x['Year']== 2011) and (x['Date']< datetime.datetime(2011, 12, 30))):
diff=datetime.datetime(2011, 12, 30)-x['Date']
return diff.days
else:
return 0
final_data['diff_from_Christmas'] = final_data.apply(days_from_Christmas_for_train, axis=1)
4.Number of days left from Thanksgiving
def days_from_thanksgiving(x):
if ((x['Year']== 2010) and (x['Date']< datetime.datetime(2010, 11, 26))):
diff=datetime.datetime(2010, 11, 26)-x['Date']
return diff.days
if ((x['Year']== 2011) and (x['Date']< datetime.datetime(2011, 11, 25))):
diff=datetime.datetime(2011, 11, 25)-x['Date']
return diff.days
else:
return 0
final_data['days_from_thanksgiving'] = final_data.apply(days_from_thanksgiving_for_train, axis=1)
5.Doing label encoding based on 4 different holiday week
We can see that there is only 4 weeks which has holiday means for every year this week is the only week which contains special holiday so we will create a new feature in which we will keep this week and if it has holiday.
def holiday_type(x):
if (x['IsHoliday']== 1) & (x['Week']==6):
return 1 #SuperBowl
elif (x['IsHoliday']== 1) & (x['Week']==36):
return 2 #LaborDay
elif (x['IsHoliday']== 1) & (x['Week']==47):
return 3 #Thanksgiving
elif (x['IsHoliday']== 1) & (x['Week']==52):
return 4 #Christmas
else:
return 0
#converting IsHoliday into 0 and 1
final_data['IsHoliday_bins'] = final_data.apply(holiday_type, axis=1)
6. Counting number of holidays in a week for US.
dates =[]
for ptr in holidays.US(years = 2010).items():
dates.append(ptr[0])
for ptr in holidays.US(years = 2011).items():
dates.append(ptr[0])
for ptr in holidays.US(years = 2012).items():
dates.append(ptr[0])
holiday_count=[]
for index, row in final_data.iterrows():
dat = final_data['Date'][index]
dt=[]
for i in range(0,5):
dt.append(dat - datetime.timedelta(days = i))
for i in range(1,3):
dt.append(dat + datetime.timedelta(days = i))
count = 0
for date in dates:
if date in dt:
count +=1
holiday_count.append(count)
final_data['IsHoliday_bins'] = final_data.apply(holiday_type, axis=1)
7.Doing label encoding of Type feature
final_data.loc[(final_data.Type=='A') ,'Type']= 1
final_data.loc[(final_data.Type=='B') ,'Type']= 2
final_data.loc[(final_data.Type=='C') ,'Type']= 3
8.Doing label encoding of Type feature
final_data.loc[(final_data.IsHoliday==True) ,'IsHoliday']= 1
final_data.loc[(final_data.IsHoliday==False) ,'IsHoliday']= 0
9. Feature Importance using Random Forest
The top 10 features according to the model RandomForestRegressor in decreasing order are :
Dept>Size>Store>Week>CPI>Type>Unemployment>Temperature>Day>IsHoliday_bins
10.Baseline model after Feature Engineering
11.Time series model for forecasting i.e Facebook Prophet
What is Prophet?
“Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series data analysis be available without expert works, it is highly user-friendly but still highly customizable, even to non-expert users. How lovely!!
Prophet can handle;
- the trend with its changepoints,
- seasonality (yearly, weekly, daily, and other user-defined seasonality),
- holiday effect, and
- input regressors
In our case, I have taken input regressor basically we can add features to increase the accuracy of prediction.
Source code:
From Facebook Prophet, I got test WMAE Score of 3731
12.Various Machine Learning Models
We have used these machine learning algorithms after doing hyperparameter tunning to calculate the WMAE score.
- Linear Regression
- KNeighborsRegressor
- RandomForestRegressor
- LGBMRegressor
- CatBoostRegressor
- XGBRegressor
- StackingRegressor
Important key-analysis: I got better wmae score when we used feature based on our domain knowledge
The top 10 features selected based on model-based feature importance are:
['Store','Dept','Size','Week','Type','Day''IsHoliday_bins','Temperature','CPI','Unemployment']
The top features selected based on domain knowledge are:
['Store','Dept','IsHoliday','Size','Week','Type','Day','Year','Holidays']
13.Final Model
Our final model will be RandomForestRegressor as it is giving the lowest WMAE score on our Train data.So, I will use this model for the prediction of weekly sales on test data i.e unseen data.
from sklearn.ensemble import RandomForestRegressorfinal_data_train=final_data_train[['Store','Dept','IsHoliday','Size','Week','Type','Year','Weekly_Sales','Holidays','Day']]final_data_test=final_data_test[['Store','Dept','IsHoliday','Size','Week','Type','Year','Holidays','Day']]final_data_train['IsHoliday']=final_data_train['IsHoliday'].astype('bool')final_data_test['IsHoliday']=final_data_test['IsHoliday'].astype('bool')final_data_train['Type']=final_data_train['Type'].astype('int')final_data_test['Type']=final_data_test['Type'].astype('int')y = final_data_train['Weekly_Sales']X = final_data_train.drop(['Weekly_Sales'], axis=1)rf_Model = RandomForestRegressor(n_estimators= 140,max_depth=27,n_jobs = -1)rf_Model.fit(X, y)y_hat= rf_Model.predict(final_data_test)
Kaggle submission
I have scored WMAE score of 2638.69 on Kaggle submission and ranked under the top 5% on the leaderboard.
14.Deployment
I have deployed this ML model using flask API on localhost. I have taking store number, department number is the week is a holiday or not, and what is the date and then finding all rest features with help of this data and using the model. pkl file to predict on given raw data.
15.References
16.Future Work
- We can try various deep learning techniques like RNN, LSTM, GRU.
- There is a lot to explore in facebook prophet,I have just used the default one along with one extra addon i.e adding Regressor but there is more to add like holiday effect, seasonality (yearly, weekly, daily, and other user-defined seasonality).
17.EndNote
I would like to thank team Applied AI for their tremendous support and guideline because of which I reached up to this position to do a complete end to end case study on such a interesting problem statement.
You can find complete code on my Github here
Happy to connect with you on LinkedIn