Can Machine Predict Sales ?

Anwoy panigrahi
Analytics Vidhya
Published in
7 min readAug 24, 2020

GitHub profile and LinkedIn profile

AI and machine learning has become the most important part of this data driven world. It’s application could be extended to any field starting from simple classifications to predicting the stock market.

In this article we will see how machine learning could be used to predict the sale for the next month and also the importance of ensemble learning along with it’s implementation.

— — — — — — — — — — — — — — — — — — — — — — — —

Overview

The kaggle Predict future sales is all about predicting the next month sales for a given pair of shop and item id. Data set provided contains the daily sales of each shop and item id pair, in order to make model predict the next month sales we need to modify the data i.e we need to aggregate the daily sales over a month and the predicted labels should be [0,20] clipped.

Metrics

RMSE (Root Mean squared Error) is the metrics we will be using for solving this problem. RMSE could be defined as the square root of the mean of squared difference between the predicted value and the actual value.

Exploring The Data :

The data provided consist of five data sets

  1. sales_train.csv :- It consist of six attributes

— — — — date

— — — — date_block_num- its an unique number given to each month in the data set ranging from 0 to 33.

— — — — shop_id- Unique Id for each shop.

— — — — Item_id- Unique Id for each item.

— — — — item_price- price of each item.

— — — — item_cnt_day -day wise sale of each item

2. items_categories.csv:- It consist of two attributes

— — — — item_category_name- The name of each category of items.

— — — — item_category_id- Unique category id of items.

3. items.csv :- It contains three attributes

— — — — item_name- Name of each item.

— — — — item_id- Unique item id

— — — — item_category_id- Unique item category id.

4. shops.csv:- It contains two attributes

— — — — shop_name- Name of each shop

— — — — shop_id- Unique shop id

5. test.csv:- The required data set should be in this format i.e for each shop id and item id we need to predict the sale of next month.

Loading the data set:

Extracting information from data :

— — — — plotting items daily sale

Its clear from the graph that a few product have high sale

— — — — plotting daily sale of shops

Most of the shops have similar sales except 3 which has much higher sale than others.

— — — — plotting item price vs daily sale

Its clear from the plot that most of the daily sale contains a very few number of products i.e nearly one.

— — — — plotting average sale of shops

The average sale of most of the shops is around one.

— — — — box plot of item_cnt_day and item_price

Its clear from the plot of item price and item_cnt_day that there are some outliers in the data set which should be removed for better performance of models

Data prepossessing

For the prepossessing part we will be doing three things

— — — — removing the outliers

— — — — replacing the negative values

— — — — fixing the shop id i.e removing duplicate shop id.

Creating data set in desired format

We need to make the data set in required format i.e for each month for each shop for each item the aggregate of item_cnt_day

Feature engineering

Its the most important part in solving any machine learning problem . I have tried some new features, few of them worked for me in solving this case study.

— — — — Using holidays as feature-(worked for me ) It’s very obvious that on holidays the sale of a particular shop is likely to be high than in normal days i.e if a month has more holidays than the chance is more that the sale of that month should be high .

For including holidays as a feature first we need to find whether a given day is holiday or not and then summing up the total no of holidays in a month .

In order to merge holidays with the data set we need to store date_block_num according to holidays.

— — — — Total unique items in a shop as feature-(worked for me) The main idea behind this is that the sale is likely to be high on a shop having more no. of distinct items . In the feature importance plot unique_items has high feature importance.

For including this as a feature we just need to sum the unique items_id over a month.

— — — — Total unique category in a shop-(didn’t worked for me) The idea was that if a shop has large category of items in a month than the sale should be high in that month . But this did not worked for me though it has a high feature importance in the feature importance plot.

For including total Unique categories we just need to do the sum of item_category_id over a month .

— — — — Using name as a feature-(didn’t worked for me) I tried to use name of items by label encoding them which didn’t worked .

Feature selection

For feature selection part I have used RFECV(recursive feature elimination ). This is a very good technique for selecting important features that should be used while training the model.

What it does is that, it recursively eliminates the less important features and than train the model with the remaining ones. Doing this helps in increasing the performance of the model .

Modeling

— — — — Ensemble models :

The idea of ensemble is that the data will be used to train different models called base models and the output of these base models would be given as input(features) to the last model(meta-regressor) which will predict final output. Benefit of doing this is, the learning from all the base model is being preserved i.e while training the final model(meta-regressor) it would consider the learning of base models.

sklearn Stacking-Regressor

In common words ensemble could be thought of as taking the views from different experts about a particular problem because different experts would have different opinions and finding a conclusion based on it.

— — — — Custom ensemble :

For the modeling part I have implemented a custom ensemble model which is different from the sklearn Stacking-Regressor . In sklearn Stacking-Regressor all the base model is being trained with the same data points(same training set) but in my custom ensemble model we can pass different sample of data points to different base learner and on top of it we can train a meta-regressor . I have used Xgboost as both base learners and meta regressor .

custom ensemble model

It consist of three function, first one is for sampling data points . For sampling I have used sampling without replacement i.e in a sample a data point will be present only once.

The second part is training Xgboost base learners on different samples and saving their outputs to be used by the meta regressor. Hyper-parameter tuning of all the base model is being done in the function .

The last part is building the meta-regressor . Hyper-parameter tuning of meta regressor is also done.

I have created 10 samples for training base model and on top of it a meta regressor. The RMSE may increase on increasing the no of samples.

Comparison Table:

Kaggle score:

Future Work:

Further improvement could be done by using more features while training the model or by increasing the no.of samples i.e by increasing the no. of base learners .

Reference :

--

--