A real-world example of predicting Sales volume with Random Forest Regression on a JupyterNotebook
In this document, I will try to shortly show you one of the easiest ways of forecasting your sales data with the Random-Forest-Regressor. At the end of this document, you’ll find a link to the Jupyter Notebook as html.
First, we need to import the following libraries

Then import the dataset. The “predictionempty.xlsx”-File is a document to help us save time. Instead of creating the table in Python, I have used Excel for it. I think it is way more efficient.

Our given Dataset looks as follows:

We have a timestamp row and a Sold Units row. Our aim is to predict the sales volume for the next month. So using 15-Minutes timestamps is not helping much in this case. We should sum up the data to daily values in the next step:

Therefore we create a new row “Date” which contains the date. Now our dataset looks as follows:

Then we group by “Date” and sum up the “Sold Units” with df=df.groupby([‘Date’])[‘Sold Units’].sum().reset_index()
Our data frame df looks now as follows:

As we know so far from the qualitative analysis in our specific dataset there are days which can be marked as outliers and have a bad impact on the forecasted values. These are the national holidays, in our example the holidays in Germany. That’s why we will first filter out all the holidays from our given dataset and we will not forecast future holidays.


As you can see we have created another data frame called “test”. The “test” data frame. This is a way to save time by creating a data frame using Python. Instead of you can create a data frame using excel for your results. In our case our test data frame looks as follows:

It is an empty data frame. You can see the Date of the “to be predicted” values. And you can see another empty row with the “to be predicted” Sales Units.
To run the Random-Forest-Regressor, we need to extract more information from our given dataset. As we know so far, we have timestamps in the “Date” row and the “Sold Units” row. Our target value will be “Sold Units” but we need to create features. Features for time-series data could be:
Year, Week, Day of Month, Day of Year, Type of the day (Monday, Tuesday,…), Holidays, Weekdays / Weekends,…

We created
df[‘Year’] = pd.to_datetime(df[‘Date’]).dt.year
df[‘Week’] = pd.to_datetime(df[‘Date’]).dt.week
df[‘Day’] = pd.to_datetime(df[‘Date’]).dt.day
df[‘WeekDay’] = pd.to_datetime(df[‘Date’]).dt.dayofweek
Our data frame is looking as follows now:

Before running the Random-Forest-Regressor, we should take another look into the “Sold Units” values. It is highly recommended to use Six-Sigma approaches here. In our case, we have used the “boxplot”:
import seaborn as sns
sns.boxplot(x=df[‘Sold Units’])
or
B=plt.boxplot(df[‘Sold Units’])
[item.get_ydata() for item in B[‘whiskers’]]
And it looked like this:


And we have plotted some weekly and yearly trends:



Running the Random-Forest-Regressor
Before running the algorithm, we should evaluate whether there are better algorithms for our given dataset.


→ As we have evaluated, the Random-Forest-Regressor will be the best algorithm for our given dataset. So let’s run it and see, how it performs.


→ the first results are awesome for the given dataset. In the next step, we will use the sales data from 2015 onwards and this will lead to better results. We’ll keep you updated here.
To save the data we use our empty dataset, we have created before with Excel.


As you can see, the Random-Forest-Regressor is very strong in forecasting time-series data. In the next step, we will try using XGBoost in combination with GridSearch to boost our algorithm. We will also filter out some outliers and restrict the dataset to a homogeneous one. Further, we will also try fbprophet on the same dataset.
Link to the .html-File: https://drive.google.com/open?id=1pt0QslzeyJBunBfSZidSwB9WWnfQxdYp
