Rossmann Pharmaceutical Sales Prediction: a Deep Learning Approach

Amdework Asefa
12 min readSep 10, 2022

--

In my previous post we have seen about A/B Test with Machine Learning. Today in this blog, we will see Pharmaceutical Sales Prediction for Rossmann Drug Store and Sales by using a 6-week users’ data with Deep Learning Models.

Sample Image for Rossman Drug store and Sales. Source

Table of Contents

1. Business Need
2. Introduction
3. Data Understanding
i. Data Preprocessing
ii. Data Exploration
4. Long Short-Term Memory (LSTM)
5. Building a Deep Learning models with sklearn pipelines (LSTM)
6. Conclusion
7. Reference

1. Business Need

The finance team wants to forecast sales in all their stores across several cities six weeks ahead of time. Managers in individual stores rely on their years of experience as well as their personal judgment to forecast sales.

The data team identified factors such as promotions, competition, school and state holidays, seasonality, and locality as necessary for predicting the sales across the various stores.

  • our job is then to build and serve an end-to-end product that delivers this prediction to analysts in the finance team.

2. Introduction

I think it better to see a little bit about Rossmann.

Rossmann is a nationwide chain of German chemists. The founder of the company is Dirk Rossmann, who opened his first store in 1972 in Hanover became a pioneer of the conception of self-service chemist supermarkets. Rossmann is one of the largest drug store chains in Europe with around 56,200 employees and more than 4000 stores. If you are interested to read more about Rossmann Drug store, click here

3. Data Understanding

The most time-consuming aspect of any data science project is the transformation of data to a format that an analyst can use to build models. This is more critical for parametric models, which assume known distributions in the data. However, even before you begin to transform the data, you need to understand it.

The objectives of data understanding are:

  • Understand the attributes of the data.
  • Summarize the data by identifying key characteristics, such as data volume and total number of variables in the data.
  • Understand missing values, inaccuracies, and outliers.
  • Visualize the data to validate the key characteristics of the data or unearth problems with the summary statistics.

Hence for this post we will see all these basic data understanding processes.

The data for this blog can be found from Kaggle Rossmann Store Sales. After you download the data, you will see the data has the following fields.

1. Id — an Id that represents a (Store, Date) duple within the test set
2. Store — a unique Id for each store
3. Sales — the turnover for any given day (this is what you are predicting)
4. Customers — the number of customers on a given day
5. Open — an indicator for whether the store was open: 0 = closed, 1 = open
6. StateHoliday — indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
7. SchoolHoliday — indicates if the (Store, Date) was affected by the closure of public schools
8. StoreType — differentiates between 4 different store models: a, b, c, d
9. Assortment — describes an assortment level: a = basic, b = extra, c = extended. Read more about assortment here
10. CompetitionDistance — distance in meters to the nearest competitor store
11. CompetitionOpenSince [Month/Year] — gives the approximate year and month of the time the nearest competitor was opened
12. Promo — indicates whether a store is running a promo on that day
13. Promo2 — Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
14. Promo2Since [Year/Week] — describes the year and calendar week when the store started participating in Promo2
15. PromoInterval — describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g., “Feb, May, Aug, Nov” means each round starts in February, May, August, November of any given year for that store.

Now we have the data, so before using it directly and before starting data exploration to make visualizations we need to prepare the raw data.

https://github.com/Amdework21/Rossmann-Pharmaceutical-Sales-prediction.git

i. Data Preprocessing

Data preprocessing is the concept of changing the raw data into a clean data set. The dataset is preprocessed in order to check missing values, noisy data, outlier data, and other inconsistencies before executing it to the algorithm.

Once you extract the downloaded data, you will get four separate files as follows.

Rossmann-store-sales Dataset files (Image by Author)

Before preprocessing, looking the structure of the raw data:

Sample records form ‘store’ data and its attributes
Sample records form ‘train’ data and its attributes
Sample records form ‘test’ data and its attributes

To identify whether the data has missing values we use colums_WithMissingValue() python function form scripts folder.

def colums_WithMissingValue(self):
miss = []
dff = self.df.isnull().any()
summ = 0
for col in dff:
if col == True:
miss.append(dff.index[summ])
summ += 1
self.logger.info(f"Colums with missing values: {miss}")
return miss

Then, it will list attributes which have missing values.

Columns from ‘store’ data with missing values

We have to do this step for all data (Train, test and sample_submission)

The next step is then, fixing missing values.

handling missing values from ‘train’ and ‘test’ data

As you can see it from the screenshot above, the open attribute from test data had somehow missing values (about 0.03%). Now it is fixed.

However, in the store data, we found more missing values.

Missing values from store data

The other important step is to look for outliers in the data and prepare it for further analysis by filling missing values with mean for numerical variables, and a mode for categorical variables, and 0 for fields with more than 30% missing values. This is done as follows.

store_data['CompetitionDistance'].fillna(store_data['CompetitionDistance'].median(), inplace = True)
store_data.Promo2SinceWeek.fillna(0,inplace=True)
store_data.Promo2SinceYear.fillna(0,inplace=True)
store_data.PromoInterval.fillna(0,inplace=True)
store_data.CompetitionOpenSinceMonth.fillna(0, inplace = True)
store_data.CompetitionOpenSinceYear.fillna(0,inplace=True)

One more thing we have to do is converting the ‘Date’ to datetime so we can divide it to month, day, and year for time-required explorations.

Converting the ‘Date’ attribute of train data to ‘datetime’.

We are done cleaning and preprocessing our data. Now we can check it as follows whether anything is missed.

Check all data similarly. All are okay,

If you want to see the shapes of the dataset, you can run the following python command in your jupyter notebook.

showing the shape of the dataset.

Hence, we need to save the preprocessed data.

Now our data are ready to use, and we go for the next step, Data Exploration.

ii. Data Exploration

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. Using interactive dashboards and point-and-click data exploration, users can better understand the bigger picture and get to insights faster.

Here we will explore the behavior of customers in the various stores. Our goal is to check how some measures such as promos and opening of new stores affect purchasing behavior.

Promotion distribution

  • There are 1115 stores
  • 1017209 customer records
  • The promotion is distributed similarly between the groups
Distribution of promo of ‘train’ data
Distribution of promo of ‘test’ data

Promotion and sales

  • Promotion and Sales have a direct relationship, as the frequency of promotion grow, the amount of sales grow.
promotion and sales (0 indicates no promo and 1 indicates there is a promo)
The relationship between promotion and sales.
  • Therefore, we can say that promo and sales are directly proportional

You can also check all other related factors such as promo and customers
promo and month as well.

Check & compare sales behavior before, during and after holidays

  • There are more sales Before and After Holidays
Sales and Customers before Holiday seas
  • Even though there are small number of sales and customers during holyday, specifically a greater number of customers and sales has been observed during public holydays.
Sales and customers d

Correlation Between Promotion and Sales

  • Promotion and Sales also have a direct relationship, as the frequency of promotion grow, the amount of sales grow.
The relationship between Sales and Promotion
  • Similarly, Customer and Promotion also have a direct relationship, as the frequency of promotion grow, the number of customers also grow.

Correlation Between Promotion and Customers

  • The average customer increase across all stores due to promotion is 62.18%

The Top 10 Stores with High Promotion

  • These are lists of the top 10 Stores With high promotion and larger number of customers.
The top 10 Stores with High promotion

Which Store should increase Promotion

  • Promotions have a vital effect on increasing sales for store type a, c, and d. However, Store b shows lesser sales than the other stores. Hence It should increase more promotion (+ sign on the scatter plot).
sales and customers based on promos (scatter plot)

Trends of customer behavior during store open and closing times

  • On open days, the number of customers increase gradually. whereas on closed days, there are clearly no customers throughout the weekdays.
Behavior of customers during store open and close

Stores that are opened in all weekdays

  • When we compare it to the others, store a is opened in all weekdays, as well, its sales are constantly increasing.
Stores that are opened in all weekdays

From here, Store b shows the smallest number of opened weekdays, next store c. So, we can conclude that store a need more attention to be opened in weekdays. Store c too.

Effects of Assortment Type on Sales

  • An Assortment is a collection of goods or services that a business provides to a consumer. a=basic, b=extra, c=extended.
Effect of assortment type on salles
  • Extra(b-type) products resulted in huge sales.
  • Extended kinds (c-type) were the second popular sold items.

Effects of Distance to a competitor on sales

  • The distance between competitors and sales has a minimal connection.
  • This suggests that the distance between competitors has little effect
Effect of distance to a compotator on sales

Correlation Analysis

  • As Previously stated, there is a substantial positive association between a store’s sales and its customers. There is also a positive association between the number of customers and the fact that the store was running an offer (Promo equal to 1).
  • Hence, in order to see the correlation analysis, you have to run the following command.
  • Afterward, you will get something look like the following plot
Plot, correlation analysis

Woooh! Quite long right? never mind, you are done DEA for now. Next, we’ll See about the implementation of LSTM and basics about Deep Learning Model in the next blog. For now, let’s see something about LSTMs.

6. Long Short-Term Memory (LSTM)

LSTMs are a special kind of RNNs and are capable of learning long-term dependencies by remembering information for longer periods, that is their default behavior. RNNs suffer from vanishing gradient problems when they are asked to handle long-term dependencies. For example, in a sentence, “I have been staying in the Amhara region for the last 6 years. I can speak _________fluently” The word it predicts will depend on the previous few words in context. Here it needs the context of Amhara to predict the missed word in blank space, and the most suitable answer to this sentence is “Amharic.” In other words, the gap between the relevant information and the point where it is needed may have become very large. Vanishing and exploding gradients make RNNs unusable.

LSTMs were then introduced by Hochreiter & Schmidhuber [4] in 1997 to overcome this problem by explicitly introducing a memory unit, called the cell into the network. They work very well on many different problems and are still widely using. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell [4]. The Cell State Vector (memory cell) represents the memory of the LSTM, and it changes the forgetting of old memory (forget gate) and the addition of new memory (input gate). The Forget Gate Control (decide) what information to throw away from the cell state (memory) and decides how much of the past info it should remember. It looks at ht-1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct-1. A 1 represents completely keep this while a 0 represents completely get rid of this.

Forget layer — — — — — — — 𝑓𝑡 = 𝜎(𝑤𝑓.[𝑡−1, 𝑥𝑡 ] + 𝑏𝑓)

The Input Gate (Update) controls what new information is added to the cell state from the current input and decides how much of this unit is added to the current state. This has two parts. First, the input gate layer (a sigmoid layer), decides which values it will update and the next is a tanh layer creates a vector of new candidate values, C ̂t-1 that could be added to the state and combine these two to create an update to the state.

𝑖𝑡 = 𝜎(𝑤𝑖 .[𝑡−1, 𝑥𝑡 ] + 𝑏𝑖

𝐶̅= 𝑡𝑎𝑛ℎ(𝑤𝐶.[𝑡−1, 𝑥𝑡 ] + 𝑏𝐶

Updating the old cell state, Ct-1, into the new cell state Ct is just by multiplying the old state, Ct-1 by ft, and adding it x Ĉt and forgetting the things it decided to forget earlier. This is the new candidate value, scaled by how much it decided to update each state value.

Updating the old state cell — — — — — — — 𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶̅ ̂𝑡

The Output Gate on the other hand conditionally decides what to output from the memory. First, it runs a sigmoid layer, which decides what parts of the cell state it is going to output. Then, it put the cell state through tanh (to push the values to be between -1 and 1) and multiply it by the output of the sigmoid gate, so that it only outputs the parts it decided to.

output Gate — — — — — — — 𝑜𝑡 = 𝜎(𝑤𝑜 [𝑡−1, 𝑥𝑡 ] + 𝑏𝑜)

𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡)

I recommend you read this blog to understand LSTMs more.

7. Building a Deep Learning models with sklearn pipelines (LSTM)

I think this blog is a bit longer, it is better to see the deep Learning Model in the next Post. Thank you for reading

Image source form here

9. Reference

1. Forecasting Rossmann Store Leading 6-month Sales: https://cs229.stanford.edu/proj2015/192_report.pdf
2. Datasets: Rossmann Store Sales (kaggle.com)
3. Pharma sales data analysis and forecasting | Kaggle
4. Understanding LSTM Networks — colah’s blog
5. LSTM | Introduction to LSTM | Long Short Term Memor (analyticsvidhya.com)
6. https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f
7. What Is Deep Learning? | How It Works, Techniques & Applications — MATLAB & Simulink (mathworks.com)

--

--