M5 Forecasting-Accuracy: Time Series forecasting using Walmart sales data

Aakash V
Analytics Vidhya
Published in
5 min readJul 16, 2020
Photo by Hanson Lu on Unsplash

Competition Overview:

The hierarchical sales data of Walmart is provided in this competition M5 Forecasting-Accuracy. The goal of this competition is to predict the sales of each product for the next 28 days from 10 different stores across the three states California, Texas and Wisconsin.

The entire code for this blog and my python solution notebook for this competition can be found in this github link.

Data Description:

The dataset consists the sales history of 3049 unique products on 3 different categories across three states of the US along with the details of the price of each product and the events held on each day. It is split into three CSV files

  1. Calendar.csv: It contains the information about the dates on which the products are sold and the events and programs that held on that day.
  2. Sales_train_evaluation: It contains the historical daily unit sales of each product on each store from day 1 to day 1941.
  3. Sell_prices: It contains information about the price of the products on each week per store.

Exploratory Data Analysis:

EDA, the most crucial step to understand data and to create robust features. So let’s explore the data.

#read and check the head of the dataframe
data = pd.read_csv("sales_train_evaluation.csv")
sell_prices = pd.read_csv("sell_prices.csv")
calendar = pd.read_csv("calendar.csv")
data.head()
calendar.head()
sell_prices.head()

There are 3049 unique products on 3 categories, 10 departments(sub-categories) and 10 stores from 3 states namely California, Texas and Wisconsin.

As mentioned in the data description daily sales details for each product on each store are given. Local and global events on each day is provided on the details of day were provided.

SNAP is a nutritional assistance program conducted on US through which low income families can purchase food on that specific date. It is a binary feature representing whether snap held or not on the specific date on each state.

Price of products were not constant and the price of products on each week is provided. Now let’s visualize the sales trend using matplotlib and seaborn.

Category wise sales of products

Department-wise sales of products

State wise sales of products

Sales trend of categories on each state with 7 days rolling mean

1. Products are sold mostly on CA particularly on store CA_3.
2. Food_4 department products were sold most over the other food products and other categories.
3. There is upward trend in the sales growth which promotes the company revenue.

In the last image with the rolling mean of 7, a unique upward seasonal trend for each category can be observed on each state. So making different models for each pair will result on better score

Feature Engineering : Lag Features

Creating time Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.

Time lag features is creating the target value on day t-1, as a feature to the day t (that is previous day sales as a new feature). Since the goal is to predict the sales for the next 28 days, lagged features of 28, 29 and 30 days are created.

data['lag_t28'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28))
data['lag_t29'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(29))
data['lag_t30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(30))

Other statistical features such as mean, standard deviation, kurt, skew is also calculated from the lagged data with different rolling days.

data['rolling_mean_t7']   = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(7).mean())
data['rolling_std_t7'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(7).std())
data['rolling_mean_t30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).mean())
data['rolling_mean_t90'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(90).mean())
data['rolling_mean_t180'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(180).mean())
data['rolling_std_t30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).std())
data['rolling_skew_t30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).skew())
data['rolling_kurt_t30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).kurt())

As another important feature, change in the sell price for the past day is also created

data['lag_price_t1'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.shift(1))data['price_change_t1'] = (data['lag_price_t1'] - data['sell_price']) / (data['lag_price_t1'])data['price_change_t365'] = (data['rolling_price_max_t365'] - data['sell_price']) / (data['rolling_price_max_t365'])data.drop(['rolling_price_max_t365', 'lag_price_t1'], inplace = True, axis = 1)

Encoding the categorical columns:

#missing values on events are filled as noevent and encoded using label Encoderfrom sklearn.preprocessing import LabelEncodernan_features = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
for feature in nan_features:
data[feature].fillna('noevent', inplace = True)
cat = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
for feature in cat:
encoder = LabelEncoder()
data[feature] = encoder.fit_transform(data[feature])

Modelling using LightGBM:

Once EDA and feature engineering is over, LightGBM which a popular gradient boosting algorithm much faster than xgboost is used to forecast the sales for the next 28 days.

As mentioned in EDA, each department on each store has a unique trend and hence it is forecasted separately.

EVALUATION METRIC AND RESULT:

Weighted root mean squared error is used as the evaluation metric for the competition and the wrmse obtained is 0.62128 on the private leaderboard of kaggle using the LightGBM.

LINKS:

Solution Link: The entire code for the blog and solution can be found in this github link and it is silver medal solution for this competition.

Hope you liked this blog, any comments will be welcomed thanks !

--

--