Get started with Kaggle Competition: M5 Forecasting

Ajay n Jain
Analytics Vidhya
Published in
5 min readApr 26, 2020
Photo by Álvaro Serrano on Unsplash

Competition Description:

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data covers stores in three US States
(California, Texas, and Wisconsin) and includes item level, department, product categories, and store details.

Also, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

Data:

We have been provided with 4 CSV files

  1. calendar. csv: Contains information about the dates on which the products are sold.
  2. sales_train_validation.csv: Contains the historical daily unit sales data per product and store.
  3. sell_prices.csv: Contains information about the price of the products sold per store and date.
  4. sample_submission.csv: The correct format for submissions.

Exploratory Data Analysis:

In this article we will only look into the EDA of the challenge not the modeling.

Reading the data

cal_data = pd.read_csv('calendar.csv')
prices = pd.read_csv('sell_prices.csv')
sales = pd.read_csv('sales_train_validation.csv')
#Printing the shape
print(cal_data.shape)
print(prices.shape)
print(sales.shape)

Calendar data has 1969 rows with 14 columns, prices have 68,41,121 rows with 4 columns, and sales has 30490 rows with 1919 columns.

Starting the EDA of Sales Data

Viewing the first five rows of the sales table

sales.head()

Each entry in sales data corresponds to the sales quantity of each item in a store across all days.
For eg: the first row corresponds to item 1 of the hobbies_1 department in state California in store_1.
The sales data has 1919 columns, among which 6 columns are different types of ids and other 1913 columns are days from d_1 to d_1913.
The sales date starts from 29-Jan-2011 and ends on 24-April-2016.

Finding unique stores, items, departments, and categories

print('There are {0} items '.format(len(sales['item_id'].unique())))
print('There are {0} depts'.format(len(sales['dept_id'].unique())))
print('There are {0} categories'.format(len(sales['cat_id'].unique())))
print('There are {0} stores'.format(len(sales['store_id'].unique())))
print('There are {0} states'.format(len(sales['state_id'].unique())))

There are 3049 items, 7 departments, 3 categories, 10 stores across 3 states.

Converting days to dates to analyze the data in a better way

date_list = [d.strftime('%Y-%m-%d') for d in pd.date_range(start = '2011-01-29', end = '2016-04-24')]#Renaming days to dates
sales_df.rename(columns=dict(zip(sales_df.columns[6:], date_list)),inplace=True)
sales_df.head()

Aggregating data by different IDs to get a better picture of how these
sales vary.

#Aggregating by mean the sales by department
dept_mean = sales_df.groupby(['dept_id']).mean().T
dept_mean.index = pd.to_datetime(dept_mean.index)
#Aggregating by mean the sales by categories
cat_mean = sales_df.groupby(['cat_id']).mean().T
cat_mean.index = pd.to_datetime(cat_mean.index)
#Aggregating by mean the sales by stores
store_mean = sales_df.groupby(['store_id']).mean().T
store_mean.index = pd.to_datetime(store_mean.index)
#Aggregating by mean the sales by states
state_mean = sales_df.groupby(['state_id']).mean().T
state_mean.index = pd.to_datetime(state_mean.index)

Plotting the data

'''The parameter 'm' is frequency (by month), this can be set by day, week or year and many more'''create_plots(dept_mean,'m')
create_plots(cat_mean,'m')
create_plots(store_mean,'m')
create_plots(state_mean,'m')
Monthly mean sales of each Department
Monthly mean sales of each Store
Monthly mean sales of each Category
Monthly mean sales of each State

Findings:
1. In Department: FOODS_3 sells the highest.
2. In Category: Foods sell the most.
3. In Individual Store: CA_3 sells the most.
WI_2 started low, but it had a sudden increase in the level at the ending of the 1st quarter in 2012.
WI_3 went downhill at the beginning of 2013.
CA_2 had a decreasing trend throughout 2014 but an increasing trend in 2015.
4. In State: CA sells the most.

There was an interesting pattern of WI_2 in 2012 and CA_2 in 2015, let’s have a look at those

WI_2 in 2012
CA_2 in 2015

We can see that in both instances, there was a rise in level at the beginning of June, and the departments that had an increase in levels are also the same, FOODS_3 and FOODS_2. This may be because of new product additions.

We can also plot data for a particular date range

Sales by Category from 2012 March to 2014 March

We can see that just before Jan, the sales come down to zero, this is because Walmart is closed on Christmas.

Starting the EDA of Calendar Data

Viewing the first five rows of calendar

cal_data.head()

This data contains the details regarding events on each day and it also shows on which days SNAP purchases are allowed. In this data, SNAP follows a certain pattern in 3 of the states. In CA, SNAP is allowed on the first ten days, TX follows pattern 101–011, WI follows pattern 011.

What is SNAP: The United States federal government provides a nutrition assistance benefit called the Supplement Nutrition Assistance Program (SNAP). SNAP provides low income families and individuals with an Electronic Benefits Transfer debit card to purchase food products. In many states, the monetary benefits are dispersed to people across 10 days of the month and on each of these days 1/10 of the people will receive the benefit on their card.

Findings regarding events

Events typically have null values, as there is not an event every day, looking at the data we get to know that, there are 162 rows where event_name_1 is not null and only 5 rows where event_name_2 is not null.
There are a total of 31 unique events which belong to 5 unique types, and as we saw before 162 rows, the data given is of 5 years, so these events occur every year

Starting the EDA of Prices Data

Viewing the first five rows of prices

prices.head()

This table provides us with the sell_price of each item in each store in a particular week.

Plotting the histogram

prices['sell_price'].hist(bins=50)
plt.xlim(0,25)
Range of price of all items

We can see that the selling price varies between 0 and 25, some items go beyond this range.
The highest-priced item is HOUSEHOLD_2_516 which is priced at 107.32

End of EDA

Next Steps

  1. Get started with modeling
  2. Do more EDA to create more features

Links

Competition Link: M5 Forecasting Accuracy
Kernel Link: To look at the complete code

Final Notes

This is my first competition and my first time working with Time Series Data.
I would love your inputs regarding the Time Series and Retail Domain.
Thank You!

--

--

Ajay n Jain
Analytics Vidhya

Frontend Engineer! I observe, I write, follow for my deductions. I hope to be a Sherlock in Engineering