WALMART SALES FORECAST

ishika chatterjee
Analytics Vidhya
Published in
11 min readOct 13, 2019

--

I have tried kaggle problem as a part of my self case study. The main description and data of the case study can be found in this link: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

The entire data is found in this link and is open source and downloadable.

I will discuss and explain the entire solution that i have done in my ipynb.

Step by step everything. So, Let’s start with the data description part.

DATA DESCRIPTION:

The first step towards solution of any problem is to understand the data throughly like what the data is about and understand the throughly.

IF data is not understood then its just impossible to solve the data.

FORMAL DEASCRIPTION OF DATA:

The data contains historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and I am tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv

This is the historical training data, which covers to 2010–02–05 to 2012–11–01. Within this file you will find the following fields:

  • Store — the store number
  • Dept — the department number
  • Date — the week
  • Weekly_Sales — sales for the given department in the given store
  • IsHoliday — whether the week is a special holiday week

test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

  • Store — the store number
  • Date — the week
  • Temperature — average temperature in the region
  • Fuel_Price — cost of fuel in the region
  • MarkDown1–5 — anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
  • CPI — the consumer price index
  • Unemployment — the unemployment rate
  • IsHoliday — whether the week is a special holiday week

For convenience, the four holidays fall within the following weeks in the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

— — — — — — — — — — — — — — — — — — — — — — — -

PERFORMANCE METRIC:

PROBLEM STATEMENT:

The main problem that I am assigned with is that I have to predict the sales given the data-set. As I can understand from the problem itself is that it is a regression problem. That we have to use regression models in-order to predict the sales from the data-set.

SOLUTION:

My task here is to solve this assignment and i will try every luke and holes to solve the problem at the greatest extent.

Starting with the solution part What i have done at first is that i have downloaded the entire data and then i have loaded the data in my ipynb notebook.

I have read the individual files from the data and i have printed the coulumns. I have done this to understand the type of data that was supplied to me.As my core problem in this solution is to design an approach where I am supposed to predict the sales of walmart given the above dataset. So, it is really very important to understand each feature with which i am supplied.

After understanding the columns of the dataset what I did was merging the dataset as the dataset was split into many part. To get proper insight of the data I had to merge the dataset.

Here I can see that the final dataset is prepared . But I can see that In MarkDown1 column and in the other markdown columns there are a lot of nan values. So, it was not easy for me to understand which feature can be used for designing machine learning algorithm. So, I had to opt Exploratory Data Analysis (which is basically defined as the technique which is used to understand the behaviour of data statistically and to understand which how each and every feature is behaving using certain plotting tools.So, now I will discuss thoroughly about the Exploratory data analysis part.

EXPLORATORY DATA ANALYSIS

After going through the data I decided to analyse the data at first. The main perspective of the analysis is that Sales can be responsive to time factor and space factor Store’s sales records are the aggregation of each department Date variable can be split into y/m/w/d variables . Day variable can provide much information on sales .Outside data such as national holiday of US will be combined to add information.As we already know that in machine learning problem solving part we have to try all the possible ways to solve the problem whereas 90 percent of out experiments can be in vain. But we have to still try hard for the approximate solution.

The first thing that I did is to check out unique values of the store and type.

I could find here that the unique value is 45 in number ,i,e, 45 different store types are present. And while knowing about the unique types there are three unique types available in the supplied data.

Next what I decided to find is that behaviour of the data with respect to the type .

I have used groupby to see all the behaviour of the data including count, mean, standard deviation, minimum sales value, maximum sales value and all the percentile values (quantiles) and maximum values also. This is to get an idea about how the sales behave in each type and to get idea about the data.

NExt what I did is to find the proportions of the data in each type and focus them in a pie -chart view.

What we can see in this pie-chart is that 48.9% of the sales belongs to A store about 37.8% belongs to the B store and 13.3 % of the sales beong to C store.

In order to understand the size of the type of the sales in each of the types A,Band C ,I have used box-plot.

What we can find here is that type A store is the largest store and C is the smallest . Even more, there is no overlapped area in size among A, B, and C. Type is the best predictor for Size.

Next what I have tried to visualize is the number of weekly sales of each of the types A,B and C.

What I can see is data is overlapping too much The median of A is the highest and C is the lowest That means stores with more sizes have higher sales record (The order of median of size and median of sales is the same)

Next what i did is to get insight on the training data. I have merged the train and stores data. and then used a scatter plot on the size and weekly sales of the train data.

This result is not convincing I could not get any proper information from this data.

Next what I did was to make a scatter plot between weekly sales and size of each of the types A,B,and C.

As we can see that the weekly sales of the type A,B and C are overlapping too much .So, we cannot come to any proper conclusion from this.

So, Let’s continue with the EDA part as we have not found out anything helpful from the above plots. So, We are supposed to dig deep to get insights of the data.

Next I have tried to understand the data with stores , weeklysales and type of the training data.

Next what i have done is to find out any kind of relation amongst store , weekly sales and isholiday .

NExt what i have done is to find the relation between dept , weekly sales and isholiday.

Unlike store and holiday relation, department and holiday do not explain any relation 72 department shows the highest surge in sales during holiday However others don’t and even more in some dopartments non-holidays’ sales is higher. That means the character of product (department) is different relation with sales.

Next I did is to find whether data of weekly sales is following gaussian disribution or not. Beacuse if weekly sales follows a gaussian distribution then we can predict some other properties about weekly sales also.

The values in Weekly Sales does not follow a straight line. So. we can conclude that weekly sales donot have a gaussian distribution.

Next what I have done is to find out the range of values in weekly sales . I have used the logarithm of the actual value to get an insight of the data.

The weekly sales range from 0 to 12 (log) as we can see here.

NExt What i did as a part of EDA is to get the unique department and divide the unique departments into 4 equal part and find out the mean sales record in slots of 20.

By the line plot, we can see the followings

The sales level is different by department and the sales record level of one department is stable There is some peaked points around January and May. So there may be an event for high sales Some departments are highly related with those events. Thus, the sales record goes up steeply around Jan or May

Conclusion 1 : Department is a good feature to predict sales

Conclusion 2 : Date (especially event) is a good feature to predict sales Like department, time series of sales by store will show trend of sales Assumption : Like department, store will give a sense of sales level Assumption_2 : In addition, this will also give the day of the highest sales.

Next What i did is to group by store and date mean values and then break them into slot of 5 and plot the mean sales by time axis.

AS I could not find any proper relation amongst these columns for sales prediction, I decide to make a dummy model just to check what the worst conditions can be like.

I have tried this dummy model and checked the wmae in the kaggle website. It came around 25000 score in the kaggle page.

I have decided to work in a greedy approach start from less feature and add one feature if the score is somewhat more .

STARTING WITH THE MAIN SOLUTION

The first thing I did was to make some common featurizations on the date part as date is directly connected with sales. The core part of featurization is basically related with a lot of domain knowledge.

Featurization is one of the most important step toward modelling becuase the accuracy of the models depends a lot on the featurizations that we take . I decided to take a subset of all the features and see how the model is performing.

def multi_holiday(data):
dataset = data.copy()
holiday = []
dates = data[‘Date’].apply(lambda x: x.strftime(‘%Y-%m-%d’)).values
for i in range(len(dates)):
holiday.append(get_holiday(str(dates[i])))
holiday = pd.DataFrame(holiday)
holiday.columns = [‘super_bowl’, ‘labor’,’thank’,’chris’,’columbus’,’veterans’,’independence’,’memorial’,’washington’]
dataset = pd.merge(dataset, holiday, left_index=True, right_index=True)
normal_holiday = dataset.copy()
holiday_pre1 = []
dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=7)).strftime(‘%Y-%m-%d’)).values
for i in range(len(dates)):
holiday_pre1.append(get_holiday(str(dates[i])))
holiday_next1 = []
dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=7)).strftime(‘%Y-%m-%d’)).values
for i in range(len(dates)):
holiday_next1.append(get_holiday(str(dates[i])))
holiday_pre1 = pd.DataFrame(holiday_pre1)
holiday_next1 = pd.DataFrame(holiday_next1)
holiday_pre1.columns = [‘super_bowl_p1’, ‘labor_p1’,’thank_p1',’chris_p1',’columbus_p1',’veterans_p1',’independence_p1',’memorial_p1',’washington_p1']
holiday_next1.columns = [‘super_bowl_n1’, ‘labor_n1’,’thank_n1',’chris_n1',’columbus_n1',’veterans_n1',’independence_n1',’memorial_n1',’washington_n1']
dataset = pd.merge(dataset, holiday_pre1, left_index=True, right_index=True)
dataset = pd.merge(dataset, holiday_next1, left_index=True, right_index=True)
PreNext1week = dataset.copy()
holiday_pre2 = []
dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=14)).strftime(‘%Y-%m-%d’)).values
for i in range(len(dates)):
holiday_pre2.append(get_holiday(str(dates[i])))
holiday_next2 = []
dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=14)).strftime(‘%Y-%m-%d’)).values
for i in range(len(dates)):
holiday_next2.append(get_holiday(str(dates[i])))
holiday_pre2 = pd.DataFrame(holiday_pre2)
holiday_next2 = pd.DataFrame(holiday_next2)
holiday_pre2.columns = [‘super_bowl_p2’, ‘labor_p2’,’thank_p2',’chris_p2',’columbus_p2',’veterans_p2',’independence_p2',’memorial_p2',’washington_p2']
holiday_next2.columns = [‘super_bowl_n2’, ‘labor_n2’,’thank_n2',’chris_n2',’columbus_n2',’veterans_n2',’independence_n2',’memorial_n2',’washington_n2']
dataset = pd.merge(dataset, holiday_pre2, left_index=True, right_index=True)
dataset = pd.merge(dataset, holiday_next2, left_index=True, right_index=True)
PreNext2week = dataset.copy()
return(normal_holiday,PreNext1week,PreNext2week)

I have used store ,dept,date,sales,week,and weekly sales as training dataset just to test how this subset of the features work. As the prediction of sales is directly related with this factors i have chosen this. And i have checked the scores in kagle page. this ranges somewhere around 5020.

LAST IPYNB

--

--