Walmart Sales Forecasting
Simple Model averages can leverage the performance and accuracy of a problem(here sales) that too without deep feature engineering.
Introduction
Predicting future sales for a company is one of the most important aspects of strategic planning. And Walmart is the best example to work with as a beginner as it has the most retail data set. Also, Walmart used this sales prediction problem for recruitment purposes too.
The data collected ranges from 2010 to 2012, where 45 Walmart stores across the country were included in this analysis. Each store contains several departments, and we are tasked with predicting the department-wide sales for each store. It is important to note that we also have external data available like CPI, Unemployment Rate and Fuel Prices in the region of each store which, hopefully, helps us to make a more detailed analysis.
Dataset Overview
This data set is available on the kaggle website. These data sets contained information about the stores, departments, temperature, unemployment, CPI, isHoliday, and MarkDowns.
Stores :
Store: The store number. Range from 1–45.
Type: Three types of stores ‘A’, ‘B’ or ‘C’.
Size: Sets the size of a Store would be calculated by the no. of products available in the particular store ranging from 34,000 to 210,000.
Features:
Temperature: Temperature of the region during that week.
Fuel_Price: Fuel Price in that region during that week.
MarkDown1:5 : Represents the Type of markdown and what quantity was available during that week.
CPI: Consumer Price Index during that week.
Unemployment: The unemployment rate during that week in the region of the store.
Sales:
Date: The date of the week where this observation was taken.
Weekly_Sales: The sales recorded during that Week.
Dept: One of 1–99 that shows the department.
IsHoliday: a Boolean value representing a holiday week or not.
Total we have 421570 values for training and 115064 for testing as part of the competition. But we will work only on 421570 data as we have labels to test the performance and accuracy of models.
Data manipulation
- Checking for null values
feat.isnull().sum()
As we have few NaN for CPI and Unemployment, therefore we fill the missing values with their respective column mean. And as MarkDowns have more missing values we impute zeros in missing places respectively
from statistics import meanfeat['CPI'] = feat['CPI'].fillna(mean(feat['CPI']))
feat['Unemployment'] = feat['Unemployment'].fillna(mean(feat['Unemployment']))
feat['MarkDown1'] = feat['MarkDown1'].fillna(0)
feat['MarkDown2'] = feat['MarkDown2'].fillna(0)
feat['MarkDown3'] = feat['MarkDown3'].fillna(0)
feat['MarkDown4'] = feat['MarkDown4'].fillna(0)
feat['MarkDown5'] = feat['MarkDown5'].fillna(0)
Merging(adding) all features with training data
new_data = pd.merge(feat, data, on=['Store','Date','IsHoliday'], how='inner')# merging(adding) all stores info with new training data
final_data = pd.merge(new_data,stores,how='inner',on=['Store'])
As the data is Time-Series we sort them in ascending order so that the model can perform on the historical data.
Any metric that is measured over regular time intervals forms a time series. Analysis of time series is commercially importance because of industrial need and relevance especially w.r.t forecasting.
# sorting data with respect to date
final_data = final_data.sort_values(by='Date')
dimensions of this manipulated dataset are (421570, 16).
Exploratory Data Analysis
There are a total of 3 types of stores: Type A, Type Band Type C.
There are 45 stores in total.
sizes=grouped.count()['Size'].round(1)
print(sizes)
labels = 'A store','B store','C store'
sizes = [(22/(45))*100,(17/(45))*100,(6/(45))*100]fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.plt.show()
# boxplot for sizes of types of storesstore_type = pd.concat([stores['Type'], stores['Size']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='Type', y='Size', data=store_type)
- By boxplot and piechart, we can say that type A store is the largest store and C is the smallest
- There is no overlapped area in size among A, B, and C.\
boxplot for weekly sales for different types of stores :
store_sale = pd.concat([stores['Type'], data['Weekly_Sales']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='Type', y='Weekly_Sales', data=store_sale, showfliers=False)
- The median of A is the highest and C is the lowest i.e stores with more sizes have higher sales
Sales on holiday is a little bit more than sales in not-holiday
# total count of sales on holidays and non holidays
print('sales on non-holiday : ',data[data['IsHoliday']==False]['Weekly_Sales'].count().round(1))
print('sales on holiday : ',data[data['IsHoliday']==True]['Weekly_Sales'].count().round(1))
Correlations among features :
Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1.
A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a — sign indicates a negative relationship. Usually, in statistics, we measure four types of correlations: Pearson correlation, Kendall rank correlation, and Spearman correlation. The graph below will give you an idea about correlation.
# Plotting correlation between all important features
corr = final_data.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=True)
plt.plot()
Splitting Date into features
# Add column for year
final_data["Year"] = pd.to_datetime(final_data["Date"], format="%Y-%m-%d").dt.year
final_test_data["Year"] = pd.to_datetime(final_test_data["Date"], format="%Y-%m-%d").dt.year# Add column for day
final_data["Day"] = pd.to_datetime(final_data["Date"], format="%Y-%m-%d").dt.day
final_test_data["Day"] = pd.to_datetime(final_test_data["Date"], format="%Y-%m-%d").dt.day# Add column for days to next Christmas
final_data["Days to Next Christmas"] = (pd.to_datetime(final_data["Year"].astype(str)+"-12-31", format="%Y-%m-%d") -
pd.to_datetime(final_data["Date"], format="%Y-%m-%d")).dt.days.astype(int)
final_test_data["Days to Next Christmas"] = (pd.to_datetime(final_test_data["Year"].astype(str) + "-12-31", format="%Y-%m-%d") -
pd.to_datetime(final_test_data["Date"], format="%Y-%m-%d")).dt.days.astype(int)
Splitting Store type into categorical features.
As we have 3 types of stores (A,B and C) which are categorical. Therefore splitting wach type as a feature into one-hot encoding
tp = pd.get_dummies(X.Type)
X = pd.concat([X, tp], axis=1)
X = X.drop(columns='Type')
Therefore we have total 15 features :
- Store
- Temperature
- Fuel_Price
- CPI
- Unemployment
- Dept
- Size
- IsHoliday
- MarkDown3
- Year
- Days
- Days Next to Christmas
- A , B, C
Building train-test set
splitting final data into train and test. We kept 80%of train data and 20% test data.
#train-test split
X_train,X_test,y_train,y_test=train_test_split( X, y, test_size=0.20, random_state=0)
Out of 421570, training data consists of 337256 and test data consists of 84314 with a total of 15 features.
Machine Learning Models
We are going to use different models to test the accuracy and will finally train the whole data to check the score against kaggle competition.
Standardizing train and test data :
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
1) KNN Regressor
Out of all the machine learning algorithms I have come across, KNN has easily been the simplest to pick up. KNN can be used for both classification and regression problems. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.
from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=10,n_jobs=4)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
Accuracy KNNRegressor: 56.78497373157646 %
2 ) Decision tree Regessor
Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=0)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)
accuracy DTR: 96.20101070234142 %
3) Random Forest Regressor
Random forest is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The number of features that can be split on at each node is limited to some percentage of the total (which is known as the hyperparameter)
# After Hyper-parameter tunning
rfr = RandomForestRegressor(n_estimators = 400,max_depth=15,n_jobs=5)
rfr.fit(X_train,y_train)
y_pred=rfr.predict(X_test)
accuracy RandomForestRegressor: 96.56933672047487 %
4) XGBRegressor
XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. XGBRegressor Handling sparse data.XGBoost has a distributed weighted quantile sketch algorithm to effectively handle weighted data. For faster computing, XGBoost can make use of multiple cores on the CPU. This is possible because of a block structure in its system design. Data is sorted and stored in in-memory units called blocks. Hyperparameters are objective, n_estimators, max_depth, learning_rate.
xgb_clf = XGBRegressor(objective='reg:linear', nthread= 4, n_estimators= 500, max_depth= 6, learning_rate= 0.5)
xb = xgb_clf.fit(X_train,y_train)
y_pred=xgb_clf.predict(X_test)
accuracy XGBRegressor: 97.21754267971075 %
4) ExtraTreesRegressor
The Extra-Tree method (standing for extremely randomized trees) was proposed with the main objective of further randomizing tree building in the context of numerical input features, where the choice of the optimal cut-point is responsible for a large proportion of the variance of the induced tree. With respect to random forests, the method drops the idea of using bootstrap copies of the learning sample, and instead of trying to find an optimal cut-point for each one of the K randomly chosen features at each node, it selects a cut-point at random.
from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor(n_estimators=30,n_jobs=4)
etr.fit(X_train,y_train)
y_pred=etr.predict(X_test)
Accuracy ExtraTreesRegressor: 96.40934076228986 %
All model Comparison :
from prettytable import PrettyTable
x = PrettyTable()x.field_names = ["Model", "MAE", "RMSE", "Accuracy"]x.add_row(["Linear Regression (Baseline)", 14566, 21767, 8.89])
x.add_row(["KNNRegressor", 8769, 14991, 56.87])
x.add_row(["DecisionTreeRegressor", 2375, 7490, 96.02])
x.add_row(["RandomForestRegressor", 1854, 5785, 96.56])
x.add_row(["ExtraTreeRegressor", 1887, 5684, 96.42])
x.add_row(["XGBRegressor", 2291, 5205,97.23 ])print(x)
Getting averages of best models
The trick is to get the average of the top n best models. The n top models are decided by their accuracy and rmse. Here we have taken 4 models as their accuracies are more than 95%. The models are DecisionTreeRegressor, RandomForestRegressor, XGBRegressor and ExtraTreesRegressor.
Note that just taking top models doesn’t mean they are not overfitting. This can be verified by checking RMSE or MAE. In the case of a classification problem, we can use the confusion matrix. Also, there should not be much difference in test accuracy and train accuracy.
# training top n models
dt = DecisionTreeRegressor(random_state=0)
etr = ExtraTreesRegressor(n_estimators=30,n_jobs=4)
xgb_clf = XGBRegressor(objective='reg:linear', nthread= 4, n_estimators= 500, max_depth= 6, learning_rate= 0.5)
rfr = RandomForestRegressor(n_estimators = 400,max_depth=15,n_jobs=4)dt.fit(X_train,y_train)
etr.fit(X_train,y_train)
xgb_clf.fit(X_train,y_train)
rfr.fit(X_train,y_train)# predicting on test data
etr_pred=etr.predict(X_test)
xgb_clf_pred=xgb_clf.predict(X_test)
rfr_pred=rfr.predict(X_test)
dt_pred = dt.predict(X_test)
Getting averages of models :
final = (etr_pred + xgb_clf_pred + rfr_pred + dt_pred)/4.0
Conclusion
Here we can see that our RMSE reduced in comparison to our best performing single model i.e. XGBRegressor with RMSE of 3804. Hence we can conclude that taking averages of top n models helps in reducing loss.
As here available data is less, so loss difference is not extraordinary . But in large datasets of sizes in Gigabytes and Terabytes, this trick of simple averaging may reduce the loss to a great extent.
Kaggle Score
Now without splitting the whole data into a train-test, training it on the same and testing it on future data provided by kaggle gives a score in the range of 3000 without much deep feature engineering and rigorous hypertuning.
Future Work
- Modifying date feature into days, month, weeks.
- The dataset includes special occasions i.e Christmas, pre-Christmas, black Friday, Labour day, etc. On these days people tend to shop more than usual days. So adding these as a feature to data will also improve accuracy to a great extent.
- Also there are a missing value gap between training data and test data with 2 features i.e. CPI and Unemployment. If that gap is reduced then also performance can be improved.
References :
Thank you for your attention and reading my work
If you liked this story, share it with your friends and colleagues !
Also, follow me on