A real-world example of predicting Sales volume with Random Forest Regression on a JupyterNotebook

Ömer Faruk Aslantas
Nov 1 · 5 min read

In this document, I will try to shortly show you one of the easiest ways of forecasting your sales data with the Random-Forest-Regressor. At the end of this document, you’ll find a link to the Jupyter Notebook as html.

First, we need to import the following libraries

Then import the dataset. The “predictionempty.xlsx”-File is a document to help us save time. Instead of creating the table in Python, I have used Excel for it. I think it is way more efficient.

Our given Dataset looks as follows:

We have a timestamp row and a Sold Units row. Our aim is to predict the sales volume for the next month. So using 15-Minutes timestamps is not helping much in this case. We should sum up the data to daily values in the next step:

Therefore we create a new row “Date” which contains the date. Now our dataset looks as follows:

Then we group by “Date” and sum up the “Sold Units” with df=df.groupby([‘Date’])[‘Sold Units’].sum().reset_index()

Our data frame df looks now as follows:

As we know so far from the qualitative analysis in our specific dataset there are days which can be marked as outliers and have a bad impact on the forecasted values. These are the national holidays, in our example the holidays in Germany. That’s why we will first filter out all the holidays from our given dataset and we will not forecast future holidays.

As you can see we have created another data frame called “test”. The “test” data frame. This is a way to save time by creating a data frame using Python. Instead of you can create a data frame using excel for your results. In our case our test data frame looks as follows:

It is an empty data frame. You can see the Date of the “to be predicted” values. And you can see another empty row with the “to be predicted” Sales Units.

To run the Random-Forest-Regressor, we need to extract more information from our given dataset. As we know so far, we have timestamps in the “Date” row and the “Sold Units” row. Our target value will be “Sold Units” but we need to create features. Features for time-series data could be:

Year, Week, Day of Month, Day of Year, Type of the day (Monday, Tuesday,…), Holidays, Weekdays / Weekends,…

We created

df[‘Year’] = pd.to_datetime(df[‘Date’]).dt.year
df[‘Week’] = pd.to_datetime(df[‘Date’]).dt.week
df[‘Day’] = pd.to_datetime(df[‘Date’]).dt.day
df[‘WeekDay’] = pd.to_datetime(df[‘Date’]).dt.dayofweek

Our data frame is looking as follows now:

Before running the Random-Forest-Regressor, we should take another look into the “Sold Units” values. It is highly recommended to use Six-Sigma approaches here. In our case, we have used the “boxplot”:

import seaborn as sns
sns.boxplot(x=df[‘Sold Units’])

or

B=plt.boxplot(df[‘Sold Units’])
[item.get_ydata() for item in B[‘whiskers’]]

And it looked like this:

As you can see: Anything above 21357 is an outlier in this specific dataset.
You can do it also this way. You’ll get the same results.

And we have plotted some weekly and yearly trends:

As you can see: 2013 and 2014 are higher than in the other years. Actually, we should filter them out. But we won’t do this here.
We remove the outliers. As we know from the boxplots, that anything above 21357 and less than 681 is a statistical outlier

Running the Random-Forest-Regressor

Before running the algorithm, we should evaluate whether there are better algorithms for our given dataset.

Importing the libraries, setting the target value and dropping the not needed rows, defining a method to compare different algorithms.
As you can see: Random-Forest-Regressor will give the best R²-score in our example.

→ As we have evaluated, the Random-Forest-Regressor will be the best algorithm for our given dataset. So let’s run it and see, how it performs.

You can change the parameters. In our case, we have tried a few, and this gave the best results. Actually, you can use Gridsearch → this will give slight better results. In the next publication, I will show how to use it.
We get an R²-score of 0.877 and a Mean Percentage Error of -3.566%

→ the first results are awesome for the given dataset. In the next step, we will use the sales data from 2015 onwards and this will lead to better results. We’ll keep you updated here.

To save the data we use our empty dataset, we have created before with Excel.

As you can see, the Random-Forest-Regressor is very strong in forecasting time-series data. In the next step, we will try using XGBoost in combination with GridSearch to boost our algorithm. We will also filter out some outliers and restrict the dataset to a homogeneous one. Further, we will also try fbprophet on the same dataset.

Link to the .html-File: https://drive.google.com/open?id=1pt0QslzeyJBunBfSZidSwB9WWnfQxdYp

Ömer Faruk Aslantas

Written by

M. Sc. Business Management and Engineering | working as a Forecasting Specialist. https://www.linkedin.com/in/%C3%B6mer-faruk-aslantas-03a37016b/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade