Data Science Project: Sales Forecasting with ARIMA Model

Cem ÖZÇELİK
4 min readFeb 5, 2022

--

This article was written by Alparslan Mesri and Cem ÖZÇELİK.

In this study, we will create an ARIMA model to predict the future sales values of a market using python.

Let’s import the required libraries for the operations and procedures we will perform in our work;

After importing our libraries, we perform the import of the data set:

When we take a look at the df, we will see that the date data in the “Month” column is not very regular. We need to edit this data with data manipulation.

With the following for loop, the complex structure in the “Month” column will be arranged.

Now our date data has become as follows:

We changed the name of the column showing the sales to “Sales” and then assigned the “Month” column containing the date values ​​to the index.

And the dataset now has a new look :

We create a distplot to see the distribution of sales:

As can be clearly seen in the chart above, monthly sales volumes show a normal distribution. Although the sales volumes show a clustering of 200 units, we can say that the sales density of 400 and 600 units is also undeniable.

Statistical Test

We can perform statistical testing using the code below to make sure the data is stationary. Stationarity in time series means that the variance and mean are constant over time.

Just looking at the p-value will suffice for now.
If P <0.05; The data is stationary.
If p> 0.05; Data are not stationary.

Output:

We see that our data is not stationary. To make this data stationary, we need to give the “d” value of the ARIMA Model 1.

ARIMA Model

When creating the ARIMA model, 3 parameters are given, respectively; p,d and q.

p: How many steps ahead values at time x(t) will be taken into account in the estimation process, q: How many steps ago the estimation error at x(t) will be subjected to moving average with values, d: degree of difference taking to make the data stationary means.

It basically uses something called the AIC score to decide how good a particular prediction model is with the auto_arima function. It only tries to minimize the AIC score.

Output:

The function gave the ARIMA parameters (1,1,2) to give the best score.

ARIMA (1,1,2) means that you define some response variable (Y) by combining a 1st order Auto-Regressive model and a 2nd order Moving Average model.

Separation of data set as test and train.

Fitting the ARIMA Model:

Output:

As we can see from the output values, there are not overly large differences between the predicted values ​​of the established ARIMA model (which can be seen as “Predicted” in the output image) and the expected values ​​(indicated as “Expected” in the output image). Now, for the performance evaluation of the model we have built, we will be calculating the RMSE, that is, the square root of the mean squared error, which is used as a performance metric in Linear Regression models and describes the deviation levels of the predicted values ​​from the ideal state.

RMSE

We compared the actual values with the estimated values at hand. The RMSE error amount was 90,986. Now let’s see the forecast and actual values on the graph. Actual values will be shown with a blue line and estimated values with a red line.

Output:

It can be seen in the red line graph, which we can describe as the output of our model for the future periods, and the blue line graph, which has actually occurred, in the output image. The situation that we can pay attention to here can be seen as our prediction model does not fully overlap with the real values in order to make as consistent predictions as possible without being too frivolous. However, the important point here is that although the lines do not exactly overlap, the time-dependent uptrend and downtrend of the predicted values and real values show a consistent pattern. This can be considered as a good result in terms of preventing over-fitting.

We have come to the end of our work. See you in our next article :)

For the data set used in the study:

https://www.kaggle.com/dromosys/shampoo-sales/data

--

--