Rossmann Pharmaceutical Sales Prediction: a Deep Learning Approach
In my previous post we have seen about A/B Test with Machine Learning. Today in this blog, we will see Pharmaceutical Sales Prediction for Rossmann Drug Store and Sales by using a 6-week users’ data with Deep Learning Models.
Table of Contents
1. Business Need
2. Introduction
3. Data Understanding
i. Data Preprocessing
ii. Data Exploration
4. Long Short-Term Memory (LSTM)
5. Building a Deep Learning models with sklearn pipelines (LSTM)
6. Conclusion
7. Reference
1. Business Need
The finance team wants to forecast sales in all their stores across several cities six weeks ahead of time. Managers in individual stores rely on their years of experience as well as their personal judgment to forecast sales.
The data team identified factors such as promotions, competition, school and state holidays, seasonality, and locality as necessary for predicting the sales across the various stores.
- our job is then to build and serve an end-to-end product that delivers this prediction to analysts in the finance team.
2. Introduction
I think it better to see a little bit about Rossmann.
Rossmann is a nationwide chain of German chemists. The founder of the company is Dirk Rossmann, who opened his first store in 1972 in Hanover became a pioneer of the conception of self-service chemist supermarkets. Rossmann is one of the largest drug store chains in Europe with around 56,200 employees and more than 4000 stores. If you are interested to read more about Rossmann Drug store, click here
3. Data Understanding
The most time-consuming aspect of any data science project is the transformation of data to a format that an analyst can use to build models. This is more critical for parametric models, which assume known distributions in the data. However, even before you begin to transform the data, you need to understand it.
The objectives of data understanding are:
- Understand the attributes of the data.
- Summarize the data by identifying key characteristics, such as data volume and total number of variables in the data.
- Understand missing values, inaccuracies, and outliers.
- Visualize the data to validate the key characteristics of the data or unearth problems with the summary statistics.
Hence for this post we will see all these basic data understanding processes.
The data for this blog can be found from Kaggle Rossmann Store Sales. After you download the data, you will see the data has the following fields.
1. Id — an Id that represents a (Store, Date) duple within the test set
2. Store — a unique Id for each store
3. Sales — the turnover for any given day (this is what you are predicting)
4. Customers — the number of customers on a given day
5. Open — an indicator for whether the store was open: 0 = closed, 1 = open
6. StateHoliday — indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
7. SchoolHoliday — indicates if the (Store, Date) was affected by the closure of public schools
8. StoreType — differentiates between 4 different store models: a, b, c, d
9. Assortment — describes an assortment level: a = basic, b = extra, c = extended. Read more about assortment here
10. CompetitionDistance — distance in meters to the nearest competitor store
11. CompetitionOpenSince [Month/Year] — gives the approximate year and month of the time the nearest competitor was opened
12. Promo — indicates whether a store is running a promo on that day
13. Promo2 — Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
14. Promo2Since [Year/Week] — describes the year and calendar week when the store started participating in Promo2
15. PromoInterval — describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g., “Feb, May, Aug, Nov” means each round starts in February, May, August, November of any given year for that store.
Now we have the data, so before using it directly and before starting data exploration to make visualizations we need to prepare the raw data.
https://github.com/Amdework21/Rossmann-Pharmaceutical-Sales-prediction.git
i. Data Preprocessing
Data preprocessing is the concept of changing the raw data into a clean data set. The dataset is preprocessed in order to check missing values, noisy data, outlier data, and other inconsistencies before executing it to the algorithm.
Once you extract the downloaded data, you will get four separate files as follows.
Before preprocessing, looking the structure of the raw data:
To identify whether the data has missing values we use colums_WithMissingValue() python function form scripts folder.
def colums_WithMissingValue(self):
miss = []
dff = self.df.isnull().any()
summ = 0
for col in dff:
if col == True:
miss.append(dff.index[summ])
summ += 1
self.logger.info(f"Colums with missing values: {miss}")
return miss
Then, it will list attributes which have missing values.
We have to do this step for all data (Train, test and sample_submission)
The next step is then, fixing missing values.
As you can see it from the screenshot above, the open attribute from test data had somehow missing values (about 0.03%). Now it is fixed.
However, in the store data, we found more missing values.
The other important step is to look for outliers in the data and prepare it for further analysis by filling missing values with mean for numerical variables, and a mode for categorical variables, and 0 for fields with more than 30% missing values. This is done as follows.
store_data['CompetitionDistance'].fillna(store_data['CompetitionDistance'].median(), inplace = True)
store_data.Promo2SinceWeek.fillna(0,inplace=True)
store_data.Promo2SinceYear.fillna(0,inplace=True)
store_data.PromoInterval.fillna(0,inplace=True)
store_data.CompetitionOpenSinceMonth.fillna(0, inplace = True)
store_data.CompetitionOpenSinceYear.fillna(0,inplace=True)
One more thing we have to do is converting the ‘Date’ to datetime so we can divide it to month, day, and year for time-required explorations.
We are done cleaning and preprocessing our data. Now we can check it as follows whether anything is missed.
Check all data similarly. All are okay,
If you want to see the shapes of the dataset, you can run the following python command in your jupyter notebook.
Hence, we need to save the preprocessed data.
Now our data are ready to use, and we go for the next step, Data Exploration.
ii. Data Exploration
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. Using interactive dashboards and point-and-click data exploration, users can better understand the bigger picture and get to insights faster.
Here we will explore the behavior of customers in the various stores. Our goal is to check how some measures such as promos and opening of new stores affect purchasing behavior.
Promotion distribution
- There are 1115 stores
- 1017209 customer records
- The promotion is distributed similarly between the groups
Promotion and sales
- Promotion and Sales have a direct relationship, as the frequency of promotion grow, the amount of sales grow.
- Therefore, we can say that promo and sales are directly proportional
You can also check all other related factors such as promo and customers
promo and month as well.
Check & compare sales behavior before, during and after holidays
- There are more sales Before and After Holidays
- Even though there are small number of sales and customers during holyday, specifically a greater number of customers and sales has been observed during public holydays.
Correlation Between Promotion and Sales
- Promotion and Sales also have a direct relationship, as the frequency of promotion grow, the amount of sales grow.
- Similarly, Customer and Promotion also have a direct relationship, as the frequency of promotion grow, the number of customers also grow.
Correlation Between Promotion and Customers
- The average customer increase across all stores due to promotion is 62.18%
The Top 10 Stores with High Promotion
- These are lists of the top 10 Stores With high promotion and larger number of customers.
Which Store should increase Promotion
- Promotions have a vital effect on increasing sales for store type a, c, and d. However, Store b shows lesser sales than the other stores. Hence It should increase more promotion (+ sign on the scatter plot).
Trends of customer behavior during store open and closing times
- On open days, the number of customers increase gradually. whereas on closed days, there are clearly no customers throughout the weekdays.
Stores that are opened in all weekdays
- When we compare it to the others, store a is opened in all weekdays, as well, its sales are constantly increasing.
From here, Store b shows the smallest number of opened weekdays, next store c. So, we can conclude that store a need more attention to be opened in weekdays. Store c too.
Effects of Assortment Type on Sales
- An Assortment is a collection of goods or services that a business provides to a consumer. a=basic, b=extra, c=extended.
- Extra(b-type) products resulted in huge sales.
- Extended kinds (c-type) were the second popular sold items.
Effects of Distance to a competitor on sales
- The distance between competitors and sales has a minimal connection.
- This suggests that the distance between competitors has little effect
Correlation Analysis
- As Previously stated, there is a substantial positive association between a store’s sales and its customers. There is also a positive association between the number of customers and the fact that the store was running an offer (Promo equal to 1).
- Hence, in order to see the correlation analysis, you have to run the following command.
- Afterward, you will get something look like the following plot
Woooh! Quite long right? never mind, you are done DEA for now. Next, we’ll See about the implementation of LSTM and basics about Deep Learning Model in the next blog. For now, let’s see something about LSTMs.
6. Long Short-Term Memory (LSTM)
LSTMs are a special kind of RNNs and are capable of learning long-term dependencies by remembering information for longer periods, that is their default behavior. RNNs suffer from vanishing gradient problems when they are asked to handle long-term dependencies. For example, in a sentence, “I have been staying in the Amhara region for the last 6 years. I can speak _________fluently” The word it predicts will depend on the previous few words in context. Here it needs the context of Amhara to predict the missed word in blank space, and the most suitable answer to this sentence is “Amharic.” In other words, the gap between the relevant information and the point where it is needed may have become very large. Vanishing and exploding gradients make RNNs unusable.
LSTMs were then introduced by Hochreiter & Schmidhuber [4] in 1997 to overcome this problem by explicitly introducing a memory unit, called the cell into the network. They work very well on many different problems and are still widely using. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell [4]. The Cell State Vector (memory cell) represents the memory of the LSTM, and it changes the forgetting of old memory (forget gate) and the addition of new memory (input gate). The Forget Gate Control (decide) what information to throw away from the cell state (memory) and decides how much of the past info it should remember. It looks at ht-1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct-1. A 1 represents completely keep this while a 0 represents completely get rid of this.
Forget layer — — — — — — — 𝑓𝑡 = 𝜎(𝑤𝑓.[ℎ𝑡−1, 𝑥𝑡 ] + 𝑏𝑓)
The Input Gate (Update) controls what new information is added to the cell state from the current input and decides how much of this unit is added to the current state. This has two parts. First, the input gate layer (a sigmoid layer), decides which values it will update and the next is a tanh layer creates a vector of new candidate values, C ̂t-1 that could be added to the state and combine these two to create an update to the state.
𝑖𝑡 = 𝜎(𝑤𝑖 .[ℎ𝑡−1, 𝑥𝑡 ] + 𝑏𝑖
𝐶̅= 𝑡𝑎𝑛ℎ(𝑤𝐶.[ℎ𝑡−1, 𝑥𝑡 ] + 𝑏𝐶
Updating the old cell state, Ct-1, into the new cell state Ct is just by multiplying the old state, Ct-1 by ft, and adding it x Ĉt and forgetting the things it decided to forget earlier. This is the new candidate value, scaled by how much it decided to update each state value.
Updating the old state cell — — — — — — — 𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶̅ ̂𝑡
The Output Gate on the other hand conditionally decides what to output from the memory. First, it runs a sigmoid layer, which decides what parts of the cell state it is going to output. Then, it put the cell state through tanh (to push the values to be between -1 and 1) and multiply it by the output of the sigmoid gate, so that it only outputs the parts it decided to.
output Gate — — — — — — — 𝑜𝑡 = 𝜎(𝑤𝑜 [ℎ𝑡−1, 𝑥𝑡 ] + 𝑏𝑜)
ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡)
I recommend you read this blog to understand LSTMs more.
7. Building a Deep Learning models with sklearn pipelines (LSTM)
I think this blog is a bit longer, it is better to see the deep Learning Model in the next Post. Thank you for reading
9. Reference
1. Forecasting Rossmann Store Leading 6-month Sales: https://cs229.stanford.edu/proj2015/192_report.pdf
2. Datasets: Rossmann Store Sales (kaggle.com)
3. Pharma sales data analysis and forecasting | Kaggle
4. Understanding LSTM Networks — colah’s blog
5. LSTM | Introduction to LSTM | Long Short Term Memor (analyticsvidhya.com)
6. https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f
7. What Is Deep Learning? | How It Works, Techniques & Applications — MATLAB & Simulink (mathworks.com)