Rossmann Store Sales Prediction

Keshav Rawat
Analytics Vidhya
Published in
10 min readOct 31, 2020

This is my first medium story. Hope you find it interesting.

Photo by Isaac Smith on Unsplash

Table of Contents:

  1. Introduction.
  2. Problem statement and error metric.
  3. Objectives and data information.
  4. Exploratory Data Analysis(EDA).
  5. Feature Engineering.
  6. Modelling.

Introduction:

The demand for a product or service keeps changing from time to time. No business can improve its financial performance without estimating customer demand and future sales of products/services accurately. Sales forecasting refers to the process of estimating demand for or sales of a particular product over a specific period of time. In this article I will show you how machine learning can be used to predict sales on a real-world business problem taken from Kaggle. This case study solves everything right from scratch. So, you will get to see each and every phase of how in the real world a case study is solved.

Problem Statement

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

Error Metric: Root Mean Square Percentage Error(RMSPE)

Formula for the metric is as follows:

Objectives:

1. Predict sales for 6 weeks in advance given the data.

2. Minimize the given metric as much as possible.

Data:

Data can be downloaded from here.

Files provided are:

  1. train.csv
  2. test.csv
  3. store.csv

Data fields:

Id - an Id that represents a (Store, Date) duple within the test set.

Store - a unique Id for each store.

Sales - the turnover for any given day (this is what you are predicting).

Customers - the number of customers on a given day.

Open - an indicator for whether the store was open: 0 = closed, 1 = open.

StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.

SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools.

StoreType - differentiates between 4 different store models: a, b, c, d.

Assortment - describes an assortment level: a = basic, b = extra, c = extended.

CompetitionDistance - distance in meters to the nearest competitor store.

CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened.

Promo - indicates whether a store is running a promo on that day.

Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.

Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2.

PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store.

Exploratory Data Analysis(EDA)

Let’s perform EDA to gain insights into the given data.

Information about train.csv is as follows:

train.info()

As we can see here, we have around 1 million datapoints. Also as it is a time-seires prediction problem, we have to sort data according to date.

Here our target variable is Sales.

Information about store.csv is as follows:

store.info()

We have total 1115 unique stores. And many columns here are null values. We will take care of them in a while.

Now lets checkout the information about columns present in the data.

Promo:

Promo column against Sales and Customers.

We can see that both Sales and Customers increases by a significant amount during Promotions. This shows that Promotion has a positive effect for a store.

Sales:

Average sales over week.

It is also interesting to note that Christmas and New Year(see graph at weeks near 52) lead to increase in sales. As Rossmann Stores sells health and beauty products, it may be guessed that during Christmas and New Year people buy beauty products as they go out to celebrate and, this might the cause of sudden increase in sales.

DayOfWeek:

Sales and Customers against DayOfWeek column.

Here we observe that Sales and Customers are both very less on Sundays as most of the stores are closed on Sunday.

Also, Sales on Monday is highest in whole week. This might be due to the fact that stores are closed on Sundays.

Open:

Store Open information during a week.

This clearly shows that most of the stores remain closed during Sundays. Some stores were closed in weekdays too, this might be due to State Holidays as stores are generally closed during State Holidays and opened during School Holidays.

State and School Holidays:

Sales and Customers according to State Holidays.
Sales and Customers according to School Holidays.

We can observe that most of the stores remain closed during State and School Holidays. But it is interesting to note that the number of stores opened during School Holidays were more than that were opened during State Holidays.

Another important thing to note is that the stores which were opened during School holidays had more sales than normal.

StoreType:

Customers and Sales according to StoreType of a store.

We can see that stores of type A has higher amount of total Customers and Sales. StoreType D goes on the second place in both Sales and Customers.

CompetitionOpenSinceYear:

Distribution Plot of competition open since year.

Most of the stores have their competition opened after year 2000.

Fourier Analysis for seasonality:

StoreType ‘A’.
StoreType ‘B’.
StoreType ‘C’.
StoreType ‘D’.

Here we have plotted sales of a store of every store type.

In the above graphs we can see that there are some spikes at particular frequencies. This indicates that there is a seasonality component in the store sales data.

Thus we can use these Fourier features to indicate the seasonality in the data.

CompetitionDistance:

Distribution of competition distance of stores.

We can clearly observe that most of the stores have their competition within 5km range.

Conclusions of EDA:

  • The most selling and crowded store type is A.
  • Sales is highly correlated to number of Customers.
  • For all stores, Promotion leads to increase in Sales and Customers both.
  • The stores which are opened during School Holiday have more sales than normal days.
  • More stores are opened during School holidays than State holidays.
  • Sales are increased during Chirstmas week, this might be due to the fact that people buy more beauty products during a Christmas celebration.
  • Absence of values in features CompetitionOpenSinceYear/Month doesn’t indicate the absence of competition as CompetitionDistance values are not null where the other two values are null.
  • After analysing sales using Fourier decomposition, I found that there’s a little seasonality component in the Sales data.

Feature Engineering and small stuffs

Outliers Column:

Here I will create a column which will indicate whether a Sales value is a outlier or not based on Median Absolute Deviation(MAD).

MAD formula.

I have created outlier column store-wise, meaning I have done this for every unique store separately and then concatenated the data.

Date Features:

Date features.

First we will convert Date column using to_datetime function of pandas. After that we can simply extract other features from Date.

Holidays this week, last week and next week:

Here I have created three features which shows total number of holidays that are in current week, last week and next week. Code can be seen in my Github here.

State Holiday counter:

Above is the function that I used to create 2 new features. One shows how many days are left before a state holiday and another shows how many days passed after a state holiday.

Promo and School holiday counter:

Same as above features I created another 4 features indicating number of days before or after a Promo or a School holiday. Code on github here.

Dummy variable for Close:

This feature simply have 2 values +1 or -1. +1 if the store is closed yesterday or tomorrow else -1.

Removing points with zero Sales:

Here data points with zeros are removed because this points indicates that the particular store was closed for any reason. And if we get a store which is not open, we can simply predict zero sales for that.

Sales_per_day, Customers_per_day and Sales_per_customers_per_day:

Name of the features simply suggests their meaning. No need for further explanation here.

Competition Open and Promo Open:

Here we are simply converting these two features from ‘year’ as a unit to ‘month’ as a unit.

Feature generation using Promointerval:

Promointerval is given in the form like this: May, August, November. Now we will simply separate them like May is one feature, August is second feature and November is third feature.

Variation and Acceleration of Sales:

Variation = y - (y-1), y = sales

Acceleration = [(y-1) - (y-2)], y = sales

Fourier Features:

Here I simply use fft function from numpy to calculate fourier frequencies and amplitude. Then, use them as features.

Some other features:

These include central tendencies over DayOfWeek, Promo, Holidays etc.

External Data:

For additional data, there is only two of them. One is State data, which indicates to which state a store belongs, another is weather data about a State at particular date.

Additional data/information was taken from here.

VIF Analysis:

After adding all the features I performed VIF analysis to check the collinearity among the features. Features with high collinearity were simply discarded.

Lets do some modelling now.

Modelling

Base model:

These are the features that I kept to create a base model:

‘Store’, ‘DayOfWeek’, ‘Promo’,’StateHoliday’, ‘SchoolHoliday’,’StoreType’, ‘Assortment’, ‘CompetitionDistance’, ‘CompetitionOpenSinceMonth’, ‘CompetitionOpenSinceYear’, ‘Promo2’, ‘Promo2SinceWeek’, ‘Promo2SinceYear’, ‘Year’, ‘Month’, ‘Day’, ‘WeekOfYear’, ‘DayOfYear’, ‘SalesPerDay’, ‘Customers_per_day’, ‘Sales_Per_Customers_Per_Day’, ‘PromoInterval0’, ‘PromoInterval1’, ‘PromoInterval2’, ‘PromoInterval3’.

To preprocess the data I used sklearn Pipeline and ColumnTransformer.

Numerical values are imputed with median and categorical values are imputed with most frequent. Numeric values are also scaled.

Now splitting the data into train and validation.

For base model I used simple Random Forest Regressor with default configuration.

Features’ Selection:

After creating above base model I performed forward feature selection on all the new features generated in the feature engineering section.

Following are the features which were left at last:

‘Store’, ‘DayOfWeek’, ‘Promo’,’StateHoliday’, ‘SchoolHoliday’,’StoreType’, ‘Assortment’, ‘CompetitionDistance’, ‘CompetitionOpenSinceMonth’, ‘CompetitionOpenSinceYear’, ‘Promo2’, ‘Promo2SinceWeek’, ‘Promo2SinceYear’, ‘Year’, ‘Month’, ‘Day’, ‘WeekOfYear’, ‘DayOfYear’, ‘SalesPerDay’, ‘Customers_per_day’, ‘Sales_Per_Customers_Per_Day’, ‘PromoInterval0’, ‘PromoInterval1’, ‘PromoInterval2’, ‘PromoInterval3’, ’Acceleration’, ’State’,’Promo_before_days’, ‘Promo_after_days’,’Frequency_2', ‘Frequency_3’, ‘Amplitude_2’, ’Amplitude_3', ’Mean_TemperatureC’, ’Events’.

Now with these features we can try we different models.

Below is the pipeline with new features:

For all models best parameters were chosen using trial and error.

SGD Regressor:

For this score is:

Decision Tree Regressor:

For this score is:

Random Forest Regressor:

For this score is:

Light GBM Model:

For this model score is:

Meta Learning:

This approached is defined as follows:

  1. Divide data into 80–20 split.
  2. Divide the train part into 2 parts, D1 and D2.
  3. Take 9 samples from D1 and train a Random Forest Regressor model on all samples.
  4. Predict D2 from these 9 models. Use these 9 predictions as features and D2 y_original as output to train a new model.
  5. Now for the test set, predict it using 9 models and using 9 predictions as features pass it to meta model. Use prediction from meta model as final prediction.

For this approach score is:

Conclusion:

Table for all scores.

From above table we can infer that the best model is the one that of Light GBM Model.

Future Work:

With advancements of Deep learning, to further improve the performance on the given dataset, LSTM might be good point to start.

Some other ensemble techniques can be used to check if they can improve the result.

References:

Outliers make us go MAD: Univariate Outlier Detection

Multicollinearity | Detecting Multicollinearity with VIF

Missing Data Conundrum: Exploration and Imputation Techniques

Jupyter Notebook Viewer

You can connect with me here LinkedIn and Github.

--

--

Keshav Rawat
Analytics Vidhya

Data Scientist. Sharing what I learn using articles.