Walmart Store Sales Forecasting -Kaggle Challenge

24 min readAug 18, 2021

Let’s understand the basics…..

Consider yourself as a CFO (Chief Financial Officer) of Walmart whose responsibility is to increase the ROI (Return of Investment). Being the CFO ,the retail company relies on you to carefully plan and invest in its inventories.

There are various things that influence the sales like product promotions, weather, competitor actions, festive demands, etc. For example, when it’s warmer people prefer swimming on the beaches than during the winters.It’s likely that sales on swimmer goods are going to shoot up during summers and not on winters.

There are key points we need to understand while predicting the sales.

It is easier to make forecasts for the near future than for the future 200 days from now. Farther the future ,farther we are from being accurate.
It is good to use multiple techniques for forecasting than just relying on one.
It is challenging to predict sales when we obtain less historical data. More the history of sales at hand more we analyze deeper patterns in customer demands.

“No amount of sophistication is going to allay the fact that all of your knowledge is about the past and all your decisions are about the future.”
— former GE exec Ian Wilson

Business Problem

The sales prediction problem was launched by Walmart via Kaggle as a part of it’s recruitment process to predict their weekly sales for every department of all the stores.We were given historical sales data of 45 stores and 81 departments for every store.

How Machine Learning comes into picture?
Why not other methods like moving averages where we can just use excel to forecast

How is machine learning useful in this scenario ? There is not clear cut pattern humans have set for buying and selling,However backing up the datasets spanning years into the past it’s highly possible to draw patterns in sales sand consumption.Machine learning becomes prominent here because of its ability to mine through years of data to detect patterns and repetitive behavior, which can then be leveraged to forecast sales and demand.

Conventional methods like moving averages and exponential moving averages we are assuming future sales values might be some way nearer to the recent past values, thereby we do not capture the external driving factors (Example: Weather conditions, Holiday weeks, Seasons, promotional events etc) that determine the abrupt spike in sales values in some weeks.Using machine learning we are not just able to capture abnormalities in data but also these algorithms are self-correcting.(i.e) as newer data gets addeed in future , the model accustomes itself to it and learns many new uncertainities in data.

Business Objectives and Constraints

Robustness Ensuring using Cross validation

2. Interpretability As we forecast sales and hand it over to buinesses,it is important to answer Why’s ? and How’s.

3.Accurate prediction matters as it might impact taking business decisions

4. No latency requirements (i.e) there is not much demand in predicting sales in milli seconds.

5. Models should be Evolutionary as consumer demands/supplies can change over time.

Data Overview

Data was collected from year 2010–2012 for 45 Walmart stores. We are tasked with predicting department wise sales for each store. We are provided with datasets : Stores.csv,train.csv,test.csv,fetaures.csv

Dataset Size

Trainset :421570 rows (12.2MB)

Testset : 115064 rows (2.47MB)

Stores : 4KB

Features: 580 KB

Column Descriptions

Stores :

Type: Type of the store namely “A”,”B”,”C”

Size: Size of the store .Size refers to the number of products inside the store .Range 34000 -210,000 8/4/2021 4/64

Sales (Trainset/Testset):

Store — The store number. Range 1- 45.

Dept — One of 1–99 that shows the department

Date — the week

Weekly_Sales — sales for the given department in the given store

IsHoliday — whether the week is a special holiday week.

Features :

Temperature — Average temperature in the region during that week. Fuel_Price — cost of fuel in the region during that week

MarkDown1–5 — anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA. represents what quantity was available during that week. Type is from 1–5.

CPI — the consumer price index during that week.

Unemployment — the unemployment rate during that week

Understanding the type of Machine Learning Problem that we are dealing with ….

It’s a regression problem where we are given time series(from 05–02–2010 to 26–10–2012) data with multiple categorical and numerical features.We are tasked to predict department wise sales of each store from 02- 11–2012 to 26–07–2013 . In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

Performance Metrics

If you can’t measure it you can’t improve it -Peter Drucker

Why not MSE/RMSE/MAPE ??

WMAE, MAE seems to be a good choice for performance metric as it is robust towards outlier points than compared to RMSE, MSE. We are going to train multiple models and we want to check how each model explains the variance of the outcome. Hence we also use R-2 (R-squared) for comparing relative model performance.

Why don’t we want to go forth choosing MAPE for this study ??.
MAPE produces undefined outcome when actual values are 0 .((1/n * sum(0-prediction / 0) ~ undefined).We also have more than 100% MAPE which is tough to interpret. MAPE treats under forecast and over forecast differently.

MAPE >100% means that the errors are “much larger” than the actual values (e.g. actual is 2, you predict 4, hence MAPE is 200%). For under forecasts we have MAPE < 100% and over forecasts have MAPE > 100%.Hence MAPE doesn’t seem to be the best choice .

EXPLORATORY DATA ANALYSIS

Data Loading

### loading our trainset and test set 
trainset = pd.read_csv("train.csv")
testset = pd.read_csv("test.csv")
featureset = pd.read_csv("features.csv")
storeset = pd.read_csv("stores.csv")## Check number of rows and columns
print(“Shape of trainset: “,trainset.shape)
print(“Shape of testset: “,testset.shape)Shape of trainset:  (421570, 5)
Shape of testset:  (115064, 4)

Identifying Variables and Datatype

We loaded our CSV files features, store, train, and test in a pandas data frame, and took a look at the columns we are going to deal with. In the dataset, we have numerical variables Weekly_Sales, MarkDown1:5, Temperature, Fuel_price, CPI, Unemployment,Size. Categorical variables are IsHoliday, Type, Store, and Department.

Data Cleaning and Preprocessing

How null values can effect our data ?
Missing values are common occurrences in data. Unfortunately, most predictive modeling techniques cannot handle any missing values. Therefore, this problem must be addressed prior to modeling. Missing/null values in dataset arises problems of “Value Error” while modeling

## lets check missing values in our feature set 
print(‘*’*50)
print(‘Unique values in feature:’,featureset.nunique())
print(‘*’*50)
print(‘Null values check in Feature :’)
print(featureset.apply(lambda x: sum(x.isnull()),axis=0))

We can see that CPI,unemployment and markdown features have null values we can replace them with rolling mean with period = 20.In our trainset total number of unique stores are 45 , unique departments are 81 and unqiue dates are 143 . Out trainset should contain total of 43x81x143 = 521235 data points.However,In our dataset we have total of 421570 data points only.

Theory : There must be missing store,department and date combinations!!!

Upon finding that there are few missing combinations we have decided to treat them by adding extra rows in our trainset with the missing store,department date .We will mark their weekly sales as 0 and ‘ISholiday’ column as False.Now let’s dig in treating the markdown columns…..

print("% of missing value in Markdown1: ",len(featureset[featureset.MarkDown1.isnull() == True])/len(featureset) * 100,"%")
print("% of missing value in Markdown2: ",len(featureset[featureset.MarkDown2.isnull() == True])/len(featureset) * 100,"%")
print("% of missing value in Markdown3: ",len(featureset[featureset.MarkDown3.isnull() == True])/len(featureset) * 100,"%")
print("% of missing value in Markdown4: ",len(featureset[featureset.MarkDown4.isnull() == True])/len(featureset) * 100,"%")
print("% of missing value in Markdown5: ",len(featureset[featureset.MarkDown5.isnull() == True])/len(featureset) * 100,"%")

We are going to treat missing values by two ways :
If there are more than 75% of null values we will drop the column as we can’t even retain 75% information from them.
If there are less than 75% of null values we will perform imputation such as replacing with mean value per store

Data Preparation : Let’s pipeline our data cleaning and preprocessing !!

We are going to clean our featureset,train and test set,Impute missing values in columns CPI,Unemployment with rolling averages and calculate mean for markdown columns,Convert to appropriate datatypes,Generate new rows for missing store ,dept combos with weekly sales as 0,merge all the dataset and prepare final processed datasets.

o/p:

There are total of 45*81 = 3645 different store+department combination each having a different time series pattern.

Note: we have 143 unique dates in trainset and 39 unique dates in test set.The trteset has both the train and test points merged,hence there will be 182 total dates per store and department.

Duplicates?

duplicates = len(trte_set[trte_set.duplicated(subset=['Store',"Dept",'Date'])])o/p :Does the datset contain duplicate values :  False

We performed sanity checks now and then to see if we can find all the store+department+date combos.

Analysis on Target Column Weekly Sales

### Let’s check the distribution of the weekly sales
x=trte_set[trte_set[‘Weekly_Sales’]>0][‘Weekly_Sales’] #x~N(0,1)
sns.set_style(‘whitegrid’)
sns.kdeplot(np.array(x))
plt.title(‘KDE plot’)
plt.xlabel(‘Sales’)
plt.show()

The distribtution is clearly powerlaw

Let’s perform box-cox transformations and check if we can fit it into gaussian distributions

### performing box-cox transformation to see if we can tranform to gaussian
from scipy import stats
import matplotlib.pyplot as plt
fig=plt.figure()
ax1=fig.add_subplot(211)
## plot before box-cox
x = trte_set[trte_set[‘Weekly_Sales’]>0][‘Weekly_Sales’]
prob=stats.probplot(x,dist=stats.norm,plot=ax1)
ax1.set_xlabel(‘’)## plot after box-cox
ax2 = fig.add_subplot(212)
xt, _ = stats.boxcox(x)
prob = stats.probplot(xt, dist=stats.norm, plot=ax2)
ax2.set_title(‘Probplot after Box-Cox transformation’)
fig.tight_layout()
plt.show()

As we can observe clearly that the weekly_Sales data follows Power law distibution.After performing box-cox transformations we can observe the distribution is directed towards normal-distribution to a good extent.Howevr we are not completely able to make it gaussian.

Is there sales < 0 and Store and Department combination with all zero sales ??

print(‘Number of rows which has negetive sales reported == ‘,len(trte_set.loc[trte_set[‘Weekly_Sales’]<0]))o/p:Number of rows which has negetive sales reported ==  1285

Tansforming negetive sales to 0 …..

trte_set.loc[trte_set[‘Weekly_Sales’]<0,’Weekly_Sales’] = 0### Check time series with all 0 weekly sales values!!
grp = trte_set.groupby([‘Store’,’Dept’])[‘Weekly_Sales’].sum()
print(‘Number of time series with all zero sales == ‘,len(grp[grp==0]))
store_dept_with_all_zeros_sales = list(grp[grp==0].index)o/p:
Number of time series with all zero sales ==  322

We see there are 322 timeseries for which we do not have to take effort to predict its weekly sales as training data has all 0 values.So we can directly predict the future as 0.

Dropping the 322 store+ department combinations …..

## reset the index 
trte_set = trte_set.reset_index(drop=True)
for i,j in store_dept_with_all_zeros_sales:
 trte_set.drop(trte_set[(trte_set.Store == i) & (trte_set.Dept == j)].index ,inplace=True)print(‘Total number of rows deleted == ‘,182 * len(store_dept_with_all_zeros_sales))
print(‘Length of train and test after deletion ==’,len(trte_set))
o/p:
Total number of rows  deleted ==  58604
Length of train and test after deletion == 604786

Univariate Analyis

Mean Sales over the years

## let’s craete two date features here week,year for analysis
trte_set[‘week’] = trte_set[‘Date’].dt.week
trte_set[‘year’] = trte_set[‘Date’].dt.year
sales_per_week_year = trte_set.groupby([‘Date’,’week’,’year’],as_index=False)[‘Weekly_Sales’].mean()

Observations : We can observe that during thanksgiving week(Nov 26) and christmas week(December 24) soaring peaks of sales.Holidays like superbowl occurs in first sunday of feb month,labour day on first monday of september !!

Theory :It will be useful to include features like week number, day, month’s week number.Over the years 2010–2012 the time-series data followed a similar pattern.

Median Sales per Type

Observations : As observed from the pie chart ,we can see that type “A” > Type “B” > Type”C” in producing sales.

WeeklySales per Store

Observations :Few stores show higher mean sales and higher interquartile range depicting that few stores are bigger than the others .Some stores have their 25th and 50th percentile values very close and 75th percentile is far away showing that there is not much sales values difference between point below 25% and point below 50% .There was sudden increase in sales after 50 % of points.(might be flier points showing sudden increase in sales due to some external factor)

Theory : We can understand that the stores with the higher weekly sales(1,2,4,10…) can be categorized as type ‘A’ stores with medium sales as type ‘B’(11,2,19,20…) and the rest as type ‘C’(3,5,29,30…)

Weekly_Sales per Department

Observations :We have calculated the mean sales per department and plotted it separately for readability.From this bar plot we can see department 2,38,40 has greater mean sales comparated to other departments which shows that these department have more useful products for the customer.People are more likely to buy products from these departments. Department 72,7,5 we can observe high sales during few weeks which is farther away from their mean sales.There is a posibility that these departments supply customers with products which are needed during Holiday Weeks.
Theory : As few departments accounts for very high relative mean sales, we can assume that department is a good predictor.

Weekly_Sales on Holidays

Observations : As we can see outlier points on Holidays are greater than non-Holidays.The mean sales/inter-quartile range values for both Holiday and nonholiday weeks are overlapping,Which doesn’t make ‘IsHoliday’ a good predictor.Which indicates are only a few Holiday weeks which clearly increases sales values than non-Holiday Weeks.

Analysing Numerical features

Kde and scatter plot for numerical features

Observations:We can see that due to higher missing values in the markdown columns, the distribution is right-skewed (i.e ) relatively higher data points at one value.

Note : We performed outlier detection for the numerical features by calculating
percentile values at 0,10,20,30,40,50,60,70,80,90,100 and found that there are no outliers
Multivariate Analysis

Type X Size

## Size*Type 
fig = plt.figure(figsize=(8,5))
sns.boxplot(data=trte_basic_feat,x=’Type’,y=’Size’)
plt.title(‘Type X Size’,fontsize=18)
plt.show()

Observations :

Here we can see how A,B and C stores are clearly differentiated by their size.We already exlored how Type A > B > C.Size here confirms bigger the store bigger the size.

Feature Engineering (Basic)

Let’s create some useful date features like month,dayofweek , MajorHolidays
We observed that in our trainset we are given with IsHoliday as true for week 52(supposedly our christmas week) but there is no spike in sales , week 51 has spike in sales which is our week before chirstmas . lets’ create a categorical feature Major_Holidays which calculates the holiday week of Christmas,Thanksgiving and Labour day for every year.This will allow us to spot the sales spike effectively we will categorize them as ‘0’ for ‘Non-Major Holiday’, ‘1’ for Thanksgiving and ‘2’ for Labour and ‘3’ for Christmas and 4 for superbowl.

Creating basic date features….

Data after adding Basic date features ……

Multivariate Analysis on Newly added features
Effect of Department on major_holiday

Observations :We can say that major_hols is a good predictor of sales based on major holidays.There is a clear cut effect of certain department on Holidays.major_hols=3 indicates Christmas .We can see peaks in sales in first 10–20 departments which was not observed in other holidaysSales of few departments also seems to have reduced during the Holiday week than the non-holiday weeks.We can See that although Type ‘B stores are smaller tha type ‘A’ Store during certain holidays Type-B store produced high sales in than A.

Effect of Stores on Major_holidays

Observations:

Similar to Department we can see that stores also have similar effect on sales during Holidays.We can say that Christmas and Thanksgiving are the major holidays which uplifts the sales scale.

Effect of Sales on a Month,Type and IsHoliday

Observations: Month&Holiday is also a good predictor of sales as we can see during months 9–12 during dolidays there is sales peak in stores A and B

Sales During Quaters

Observations:

We see that mean sales are almost same across all the quaters.Quater 4 has has flier points depecting sudden peak in sales in some week.

Correlations

Correlation between numerical Features

Fuel price and year are highly correlated,hence we can decide to drop Fuel_price.MarkDown1 and MarkDown4 are corelated ,lets drop markdown 4.Size has the strongest corelation(0.2) with Weekly_Sales.Fuel_price and Temperature has very weak coorelation with sales.Week is strongly co-related with month,dayofyear and quater,let’s drop column dayofyear and month as it’s strongly corelated to week.

Correlation between Categorical and Numerical

ANOVA — Test

Is a statistical test which can be used to determine the correlation between categorical and continuous variable.Anova test performs analysis of variance by calculating population means of each category.Here the null hypothesis(H0) states that the variables are not correlated. we can say if p_value < alpha(0.05) value we reject our null hypotheseis.If p_value > alpha value and F value > f critical we accept our null hypothesis.

def anova_table(aov):
    aov['mean_sq'] = aov[:]['sum_sq']/aov[:]['df']  ## calculating mean_sq
    aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq']) ### will tell the variance explained sseffect /sstotal
    cols = ['sum_sq', 'df', 'mean_sq', 'F', 'PR(>F)', 'eta_sq']
    aov = aov[cols]
    return aov
def ols_(column):
    model = ols('Sales ~ C({})'.format(column), data=data).fit()
    aov_table = sm.stats.anova_lm(model, typ=2)
    return aov_tablefor i in categorical_features:
    x = trte_basic_feat.loc[(trte_basic_feat.set_type=='Train')][i]
    y = trte_basic_feat.loc[(trte_basic_feat.set_type=='Train')]['Weekly_Sales']
    data = pd.DataFrame({i:x,'Sales':y})
    aov_table = ols_(i)
    aov = anova_table(aov_table)
    print('Stats Summary for feature ',i,'*'*20)
    print(aov)
    print('Variance explained (eta-squared) by the feature ',i,' :',aov_table['eta_sq'][-2])
    print('\n\n')

P.S degree of correlation using (eta-squared )

0.001–0.009 — weak correlation
0.01–0.2 — moderate correlation
0.2–0.5 — good correlation

Observations :

Using the anova test we checked for correlation between categorical(store,dept,isholiday etc) and numerical variable (sales).We see that the department feature is well correlated to weekly sales with 0.54 correlation.Store feature is moderately correlated (0.09) with weekly Sales.We see that type,major holiday and isholiday posesses weak correlation.

Time-Series Analysis

Why do we perform Time-Series Analysis ?

Using time-series we can get the insight of how time can add weightage in determining the value of an asset/commodity or variable of our concern. In our ususal Machine Learning analysis we check how the independent variables affecting our dependent variable,order doesn’t matter. In our time series analysis we check how the past observations of our target variables varies with respect to time .Here the data is ordered with respect to time which creates a dependency between past and future observations.

Time-Series Components !!!

There are 4 major components of Time-series
Level: Average value of the series
Trend: Can be interpreted as a slope.Rate of change within two adjacent values.This is a optional component
Seasonality: Shows if there is any cyclic behaviour in the data.(i.e)Repeatedness after a time period of time
Noise: Uncertain variation which cannot be interpreted the model.

Observations: We can see clearly that data contains both upwards and downward trends.Additive Seasonality — We can see that there is 90% spike in sales during year-end every year additively. Data is also quite noisy, which depicts uncertainty in sales in some periods . Here is when we use the smoothening technique. We are going to average out some past values which will eventually reduce noise in data.

Feature Engineering : (Advanced)

We are going to implement some Averaging techniques for each of our store and department columns as features.
Lag features: We are using the shift() function from pandas library.Here we consider lags based on trend,the trend is yearly so lets take 52 lags
Rolling window: We are using rolling() function from pandas library.Here weConsider a constant window size(3) and take average of past values within that window.
Expanding window:We are using expand () function from pandas library.Here we Consider a expanding window size(2) and take average of past values within that window.
Exponential weighted moving average :We are using ewm() function from pandas library. Here recent values are given more weight and the weights reduce exponentially as we move further past values
Holt’s Winters triple exponential smoothening: Consider smoothening by taking into account of level,trend and seasons

HOLT WINTER’s — TRIPLE EXPONENT SMOOTHENING:

As our data contains additive seasonality. Holt winter’s triple exponent smoothening is apt to capture forecast. The values are smoothened at 3 levels namely level, trend, and seasonality.

Info….

𝑦̂ 𝑡+ℎ|𝑡ℓ𝑡𝑏𝑡𝑠𝑡=ℓ𝑡+ℎ𝑏𝑡+𝑠𝑡+ℎ−𝑚(𝑘+1)=𝛼(𝑦𝑡−𝑠𝑡−𝑚)+(1−𝛼)(ℓ𝑡−1+𝑏𝑡−1)=𝛽∗(ℓ𝑡−ℓ𝑡−1)+(1−𝛽∗)𝑏𝑡−1=𝛾(𝑦𝑡−ℓ𝑡−1−𝑏𝑡−1)+(1−𝛾)𝑠𝑡−𝑚,y^t+h|t=ℓt+hbt+st+h−m(k+1)ℓt=α(yt−st−m)+(1−α)(ℓt−1+bt−1)bt=β∗(ℓt−ℓt−1)+(1−β∗)bt−1st=γ(yt−ℓt−1−bt−1)+(1−γ)st−m.

Ref info: https://grisha.org/blog/2016/02/17/triple-exponential-smoothing-forecasting-part-iii/

Hyperparameter tuning Alpha, Beta and Gamma Values

For triple exponential smoothening the minimum mape value was found to be  4.187637418634715
The best alpha,beta and gamma parameters for triple exponential smoothening having minimum MAPE value are (0.25, 0.1, 0.2)

The model was refit after performing hyperparameter tuning and added Holt’s prediction as a feature to our training set.

Data set after adding Advanced features

Label encoding Categorical Features

trte_featured['IsHoliday'] = trte_featured['IsHoliday'].map({True:1,False:0})
trte_featured['Type'] = trte_featured['Type'].map({'A':1,"B":2,"C":3})trte_basic_feat['IsHoliday'] = trte_basic_feat['IsHoliday'].map({True:1,False:0})
trte_basic_feat['Type'] = trte_featured['Type'].map({'A':1,"B":2,"C":3})trainset_merged['IsHoliday'] = trainset_merged['IsHoliday'].map({True:1,False:0})
trainset_merged['Type'] = trainset_merged['Type'].map({'A':1,"B":2,"C":3})testset_merged['IsHoliday'] = testset_merged['IsHoliday'].map({True:1,False:0})
testset_merged['Type'] = testset_merged['Type'].map({'A':1,"B":2,"C":3})

Feature Selection using Forward Feature Selection

We are going to run randomforest alogorithm and see what are the featuresof features gives the least MAE from our basic features.

## creating our input and target set
input_ =  trte_basic_feat.loc[trte_basic_feat.set_type=='Train']
input_['Date'] = input_.Date.map(datetime.toordinal)
input_ =  input_.fillna(-1).reset_index(drop=True)
output = input_.Weekly_Sales
input_ = input_.drop(columns = ['Weekly_Sales','set_type'],axis=1)##fitting the model
seqfs =  SFS(RandomForestRegressor(n_estimators=200,random_state=2),
k_features='best',forward=True,floating=False,verbose=2,scoring='neg_mean_absolute_error',cv=0)seqfs.fit(input_,output,custom_feature_names=input_.columns)#storing the output in a dataframe
df = pd.DataFrame.from_dict(seqfs.get_metric_dict()).T
sorted_df = df[['feature_names','avg_score']].sort_values(by='avg_score',ascending=False).reset_index(drop=True)

Below feature combination gives the lowest mean abolute error on  performing model:
  ('Store', 'Dept', 'Date', 'IsHoliday', 'Type', 'Size', 'week', 'month', 'day', 'major_holiday')

Observations : As we can see that certain combination of features reduces the model performance where as some boosts the model performance.We just want to select the features which gives least MAE.We see that all the feature combination which has Markdown columns increases the error so we want to remove them at all costs .Remember Markdown column had more than 50% missing data.

Modelling

After performing EDA , Data cleaning ,Data prerocesssing,Feature Engineering and Feature Selection process we have prepared our dataset which is ready to be fed inside different models ,hyperparameter tune them and we compare the performances.

What are the different Regressor models we are going to build and compare??

1.Baseline model 2.RandomForest 3. Extratrees 4.XGBoost 5.RandomForest(trained with rolling features) 6. Extratrees(trained with rolling features) 7.XGBoost(trained with rolling features) 8.SARIMA 9.FBProphet 10.Stacking model

Temporal data Split — Ratio (60(train):20(cv):20(test))

### Let's do a 80:20 split for train as (train and test)
x_train =  trte_featured.loc[(trte_featured.set_type == 'Train')]
x_train = x_train.drop(['set_type'],axis=1)x_test =  trte_featured.loc[(trte_featured.set_type == 'Test')]
x_test =  x_test.drop(['set_type'],axis=1)y_train = x_train['Weekly_Sales']## creating our validation set
udates = x_train.Date.unique()
len_ = int(80*len(udates) /100)
tr_dates =  sorted(udates)[:len_]
te_dates = sorted(udates)[len_:]X_train = x_train.loc[(x_train.Date.isin(tr_dates))]
Y_train = X_train['Weekly_Sales']
X_cv = x_train.loc[(x_train.Date.isin(te_dates))] 
Y_cv = X_cv['Weekly_Sales']
## convert all the n/a's with -1 and reset the indexes

Calculating the error metrics

def cal_wmae(holiday,true,predict):
    '''Calculate weighted mean absolute error with weightage 5 for holiday week and 1 for non-holiday week'''
    err = []
    sum_ = 0
    for i in range(len(true)):
        if holiday[i]:
            err.append(abs(true[i] - predict[i])*5)
            sum_+=5
        else:
            err.append(abs(true[i] - predict[i]))
            sum_+=1
            
    return sum(err)/sum_def calculate_r2_mae(true,predict):
    mae=mean_absolute_error(true,predict) 
    r2=r2_score(true,predict)    
    return r2,mae

Baseline model

Using baseline model we are going to predict future week’s sales based on the average of the sales on the same week in the past years.

Example : If we have data from 2010–2012 and want to predict sales for week 35 for 2013, then we take the average of sales on week 35 from 2010 to 2012

Model is fit with training data
Model is fit with training data
Model is fit with training data

Random Forest ,Extratrees and XGBoost

Let’s determine hyperparameters for our Tree models

##Hyper parameter tuning random forest# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 500, num = 6)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [10,20,27,30]
# Minimum number of samples required to split a node
min_samples_split = [5, 12]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2, 6]
# Method of selecting samples for training each tree
bootstrap = [True, False]

randomforest_params = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
# create xgboost params 
xg_boost_params = {'n_estimators': n_estimators, 
                   'max_depth' : max_depth,
                   'subsamples': [0.4,0.5,0.7],
                   'colsample_bytree':[0.4,0.5,0.7],
                   'learning_rate':[0.01, 0.1],
                   'min_child_weight': [5, 10,20],
                   'objective': ['reg:squarederror']}## create extratrees params
extra_tree_params = {'n_estimators':n_estimators,
                    'max_features': max_features,
                    'min_samples_leaf':min_samples_leaf,
                    'min_samples_split': min_samples_leaf}

Let’s consider just the top columns from the result of our Feature selection

top_cols_basic_feats = ['Store','Dept','IsHoliday','major_holiday','Size','week','Type','day','year']

Hyperparametertuning Our models using RandomSeasrchCV with cv=2

After tuning we return the predictions after refiting the model with best_estimator.

## Function calls 
rf_train_predict,rf_test_predict  =  run_tree_models(x_train.copy(),x_test.copy(),y_train.copy(),RandomForestRegressor(),randomforest_params,top_cols_basic_feats)et_train_predict,et_test_predict  =  run_tree_models(x_train.copy(),x_test.copy(),y_train.copy(),
ExtraTreesRegressor(),extra_tree_params,top_cols_basic_feats)xg_train_predict,xg_test_predict  =  run_tree_models(x_train.copy(),x_test.copy(),y_train.copy(),xgb.XGBRegressor(), xg_boost_params  ,top_cols_basic_feats)

RandomForest,Extratrees, XGBoost trained with rolling features..

Now we add rollling features columns to our top_cols and train the models.

rolling_features = ['rolling_mean', 'expanding_mean', 'EWM_0.1', 'EWM_0.4','EWM_0.5', 'EWM_0.7', 'lag_52', 'holt_avg']
top_cols_with_rolling_features = top_cols_basic_feats + rolling_features

We hyperparameter tune our model in the same way using Randomsearchcv with cv=2 and save predictions in predict2

## Function calls 
rf_train_predict2,rf_test_predict2  =  run_tree_models(x_train.copy(),x_test.copy(),y_train.copy(),RandomForestRegressor(),randomforest_params,top_cols_with_rolling_features)et_train_predict2,et_test_predict2  =  run_tree_models(x_train.copy(),x_test.copy(),y_train.copy(),
ExtraTreesRegressor(),extra_tree_params,top_cols_with_rolling_features)xg_train_predict2,xg_test_predict2  =  run_tree_models(x_train.copy(),x_test.copy(),y_train.copy(),xgb.XGBRegressor(), xg_boost_params  ,top_cols_with_rolling_features)

Randomforest and Extratrees logs for prediction with rolling features

SARIMA

Introduction :

SARIMA is a powerful tool in predicting time-series such as weekly-sales.As the name suggests it’s useful in factoring seasonality unlike ARMIA.We checked in our EDA part that our sales data has additive yearly seasonality.SARIMA is determined by combination of non-seasonal ARIMA (p,q,d) and seasonal ARIMA(P,Q,D,S)

Quick walk through of terms :

S(AR){P} — P is the number of lags to predict the future values
S(I){D} — — D is the seasonal differencing part refers to order of difference,also ensures if the series is stationary. Eg. if d=1 then yhat = yt — y(t-1), if d=2 then yhat=(yt-y(t-1)) + (y(t-1)-y(t-2)) here yhat in d=2 shows change in change
S(MA){Q} — — Q is quantity of lagged forecasting errors.
S{S} — — — S is the season’s length.(yearly=12,quaterly=4)

D should not be more than 1 and d+D not be more than 2.if d+D=2 then constant should be supressed

ADFuller test to check stationarity

print('Dickey-Fuller Test:')
result = adfuller(data)
adftest = pd.Series(result[0:4], index=['Test Statistic','p-value','Lags Used','Observations'])
adftest = np.round(adftest,10)for key, value in result[4].items():
    adftest["Critical Value (%s)"%key] = value.round(4)print(adftest)if adftest[1] < 0.05:
    print('\np-value Value lower than 0.05%.\nWe accept H0(null hypothesis)!!The series seems to be stationary')
else:
    print("\np-value Value higher than 0.05%.\nWe reject H0(null hypothesis)!The series isn't stationary")

We check for stationarity in time series as the SARIMA model relies on it.The model assumes that the statistical properties of time-series do not change over time.

PACF(Partial Correlation),ACF(Auto-Correlation) plots to determine p,q,d

Observations :
* ACF plot : Lag 52 shows high corelation with predictor.
* PACF plot : Lag 48 and 51 individually promote high corelations,similarly lag1 is also useful.
* PACF plot : Lag 47 is highly negetively correlated.
Theory:
* ACF plot shows 52 weeks past data cummulatively to predict future sales.
* PACF plot shows that to predict a future value we can lag back 51 weeks value.
* The seasonal component (s) should be 52.

Let’s try training our model with different combination of (p,q,d) values from 0 to 2 and 52 for (S) and determine the best parameters.

Tuning SARIMA to get best p,q,d for each of the 3323 time series.

Hyperparameter tuning SARIMA on CV(X_cv) data.

We tune the model for each and every store and department combinations and store best (p,q,d,s) in best_params for each in a tuple .The predictions keep on appending in the final list as we loop through each Store and department .Finally we refit our model with x_train with the optimal parameter per store and department sand predict our test data.

Sample prediction on a random(Store+Dept)

We can see that using SARIMA with seasonal component 52 we get 7853.77 as WMAE on cv data .On testset the WMAE is 6551.97

FBProphet

##### https://facebook.github.io/prophet/
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.Using prophet we can also include the holiday effects to impact our predictions

Let’c create our holiday dataframe

# References : 0:Not Holiday ,1:Thanksgiving ,2:Labour day, 3:Christmas, 4: Superbowl   
Thanskgiving = pd.DataFrame({'holiday':'Thanksgiving','ds':pd.to_datetime(trte_featured.loc[(trte_featured.major_holiday==1)].Date.unique())})LabourDay = pd.DataFrame({'holiday':'LabourDay','ds':pd.to_datetime(trte_featured.loc[(trte_featured.major_holiday==2)].Date.unique())})Christmas = pd.DataFrame({'holiday':'Christmas','ds':pd.to_datetime(trte_featured.loc[(trte_featured.major_holiday==3)].Date.unique())})SuperBowl = pd.DataFrame({'holiday':'SuperBowl','ds':pd.to_datetime(trte_featured.loc[(trte_featured.major_holiday==4)].Date.unique())})
holidays = pd.concat([Thanskgiving,LabourDay,Christmas,SuperBowl])df = pd.DataFrame({'holiday':['Christmas','Thanksgiving','LabourDay'],
                 'ds':[pd.to_datetime('2013-12-27'),pd.to_datetime('2013-11-29'),pd.to_datetime('2013-11-29')]})
holidays = holidays.append(df,ignore_index=True)
holidays = holidays.sort_values(by='holiday').reset_index(drop=True)
holidays

For each store and department run prophet and predict for test and train set

Prophet prediction for store 10 and department 1

Observations: We see that the predicted values overlaps well with our actual values.As we included the holiday dates of christmas,superbowl,labour day and thanksgiving in our prophet model we are able to see good spike in predicted sales during these weeks.

ERROR METRICS

Let’s check out the performance of our models. As part of our modelling process, we have conventional linear models SARIMA, Holt Winter’s,Prophet which consider the past sales values for prediction. On the other hand, we have our ML tree-based algorithms that predict sales using various external factors like a week, Isholiday, year, day,major_holiday, size, type and predicts the sales.

Note: Prophet also takes into account if the date is a holiday date

Observations: We see that the model which best performs on test data is Randomforest with a WMAE of 2734.01 followed by the prophet model with a WMAE of 2851 on test data.

Theory: Given the top two models(RandomForest&Prophet) with relatively least WMAE’s, let’s try stacking the models and see if the WMAE reduces further.

Stacking Models

In order to make a robust predictive model when model ambiguity is tall, diminishes the quality of prediction!! One effective way is to form an agreement between many models

By averaging out between models we can even out overestimation and underestimation. Let’s consider our top two models RandomForest and Prophet. As we already have our Prophet and Random forest predictions trained with the best hyperparameters we don't need to train again.

rf_prophet_pred_test = np.column_stack([prophet_test_pred,rf_test_predict])
stacking_predictions_test = np.mean(rf_prophet_pred_test,axis=1)
   rf_prophet_pred_train = np.column_stack([prophet_train_pred,rf_train_predict])
stacking_predictions_train = np.mean(rf_prophet_pred_train,axis=1)

As we can see our stacked model further reduced our Weighted-mean-absolute-error to ‘2627.861’ with Kaggle Rank 20.

Best model determined: Stacked model (RandomForest & Prophet)

Plot all model Predictions of Store1 and Dept1

Observations:

We can see that all model predictions seem overlapping, and it is able to capture the spike in sales well during the holiday weeks. models like Holt winters, RF_predict_rollingefeats,et_predict_rollingefeats xg_predict_rollingefeats shows a downward prediction as we go further in the week. And does not follow a steady trend like it’s past weeks. However other good predictor models like (rf, prophet,extrtrees,xgb) follows a constant path.

Model Deployment in AWS

After we are done determining the best model to predict weekly sales we store the pre-trained model weights. In our case, we store RandomForest and Prophet model in a pickle file. Then, we create an API that takes a single input vector of store, department, date, Is holiday from the HTML front end and internally predict using pre-trained models and display the output from the model in the front end. We use Flask to host on the webpage. If we want other users in the world to access our webpage we deploy it in the AWS EC2 cloud server.

Check below video…

FutureWork :

We dealt with multi-variate time series prediction (with ~3300-time series) using for loops in the prophet. In the future, there may be cases where we want to predict sales of thousands/millions of products, then there might be those many time series. We cannot generalize the fit using a single model for all. In this situation, we can use Amazon Sagemaker’s DeepAR model where we can fit a single model for multiple time series.
Using deep-learning RNN methods like ‘LSTM’ we can capture all the internal shades of the data.

References :

Walmart Recruiting - Store Sales Forecasting

Use historical markdown data to predict store sales

www.kaggle.com

A Gentle Introduction to SARIMA for Time Series Forecasting in Python - Machine Learning Mastery

Autoregressive Integrated Moving Average, or ARIMA, is one of the most widely used forecasting methods for univariate…

machinelearningmastery.com

Prophet

Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts…

facebook.github.io

My Github : https://github.com/ayswarya-sundararaman