Using ARIMA in Python to Forecast Olist Revenue

Caio Casagrande
9 min readJan 5, 2024

--

Data science empowers businesses with statistical models and algorithms, providing them valuable insights that guide strategic decisions. For Olist, an e-commerce giant that made public data available on Kaggle, forecasting revenue is key for financial planning, mitigating risks, optimize resourses and make investments, causing a huge positive impact on the wealth of the company in the future.

About the Business Problem

This is a fictitious data science project.

Olist has encountered a challenge in managing its revenue due to unforeseen fluctuations in sales. They expected to be selling a lot more by this time of year and external factors have made it challenging for the company to anticipate its financial situation.

Objectives

  • To address this issue, the data manager suggested the use of univariate Time Series methodology ARIMA (Autoregressive Integrated Moving Average) to forecast revenue of the following two weeks.
  • The primary goal is to provide Olist with robust forecasting models capable of predicting revenue for the next 14 days.
  • By gaining insights into future sales, Olist can proactively make strategic financial adjustments to optimize its operations.

This data science project explores the use of ARIMA, a key forecasting tool, to predict Olist’s revenue. Briefly explaining the main components of it, the AR models the relationship between the current observation and its past values, while the MA process incorporates the dependency between the current observation and a residual error from a moving average model.

Exploring the Time Series

A time series is the result of observations made sequentially over time. Its most important characteristic is that adjacent observations are dependent on each other and, therefore, factors that influenced the series in the past and act in the present will continue to influence it.

The first step is taking a look at all the data. The Figure 1 is a lineplot of Olist’s daily revenue from February of 2017 to middle August of 2018. For the forecasting, the dataset will be split into train and test, where data up to July will be utilized to forecast the revenue for the days in August.

Figure 1: Daily revenue

Right away, what catches the most attention in Figure 1 is the great spike in revenue occurred on November 24th, the Black Friday of 2017, a very famous date for sales in commerce. We can also notice that, in general, 2018 revenues are higher than in 2017.

Beyond this, exploring how the data behaves at different frequencies and time scales might make it clearer to uncover new patterns:

Figure 2: Time series analysis

By reducing the frequency to monthly and bimonthly, we can see clearly the increase of revenue over time, while it is blurry on Figure 1.

Another way to analyze the data, the rolling mean acts as a smoothing filter, reducing the impact of short-term fluctuations. Smoothing the time series through a 10-day rolling mean partially removes the seasonal patterns in the data:

Figure 3: the 10-day rolling mean

After visualizing all the data, it will be splitted into train and test to introduce the forecasting. Hold-out on the last 14 days (2 weeks), data that will be forecasted.

df_train = df_revenue['2017-02-01':'2018-07-31']
df_test = df_revenue['2018-08-01':'2018-08-14']

A common step is taking a look at the main descriptive statistics of the data:

 df_train.describe().T
  • There are 546 observations in the dataset
  • The mean is $23,013.06
  • The standard deviation is $11,789.64 (almost half of the mean)
  • The minimum is $3,648.10
  • The median is $21,222.93 (the median is slightly lower than the mean, so the data is right-skewed)
  • The 75th percentile is $29,598.17
  • The maximum is $152,653.74 (more than 5 times the 75h percentile)

Decomposing the Time Series

By definition, a time series is often decomposed into three main components: trend, seasonality, and residuals (error):

  • Trend: represents the long-term movement of a time series
  • Seasonality: represents the short-term movements on periodic and repetitive fluctuations, occurring at fixed intervals in the time series
  • Residuals: the residuals (errors or noise) represent the random fluctuations in a time series — what is not accounted for by the trend or seasonality.

The seasonal_decompose function plot these components for us as follows:

from statsmodels.tsa.seasonal  import seasonal_decompose

DecomposeResult = seasonal_decompose(df_train, model='additive')
Figure 4: Seasonal decomposition of the data

The Trend is very similar to the rolling mean graph, a smoothed series. At the same time, we can see that the Seasonal component in a high frequency pattern, like weekly. Finally, the Residual of the time series remains very close to zero.

Verifying stationarity

When used for forecasting, this methodology requires the series to be stationary because it is necessary to assume that its characteristics are constant over time and, consequently, will be so in the future.

However, most cases are non-stationary, that is, they present mean and variance conditioned by time. In these cases, it is necessary to differentiate them successively until their stationarity.

If it is necessary to differentiate a time series d times, in order to make it stationary and be able to apply the ARMA (𝑝,𝑞) model, it is said to be an Autoregressive Integrated Moving Average time series — ARIMA (𝑝,𝑑,𝑞) where the letter I and the number d refer to the integration order.

The two most popular tests to check the stationarity of the time series are: Augmented Dickey-Fuller (ADF) and Kwiatkowski–Phillips–Schmidt–Shin (KPSS).

These tests are different from each other: while ADF evaluates the null hypothesis that a unit root is present in the time series, the KPSS test evaluates the null hypothesis that the time series is trend stationary.

Stationarity of time series is a great topic to explore, but in summary:

  • for ADF test, if the p-value is less than 0.05, we reject the null hypothesis and conclude that the time series is stationary
  • for KPSS test, if the p-value is less than 0.05, we reject the null hypothesis and conclude that the time series is not trend stationary
from statsmodels.tsa.stattools  import adfuller, kpss

# Augmented Dickey-Fuller test on the original data
result = adfuller(df_train)

"""
ADF Statistic: -2.8259773804546313
p-value: 0.05465782398238138
"""

# KPSS test on original data
result = kpss(df_train)

"""
KPSS Statistic: 2.650258381404954
p-value: 0.01
"""

Based on the p-values, both tests indicated the the series is non-stationary. Therefore, we test the differentiated series (d=1):

# ADF test on the differenciated data
result = adfuller(df_train.diff().dropna())

"""
ADF Statistic: -7.784723007583381
p-value: 8.237850556477621e-12
"""

# KPSS test on differenciated data
result = kpss(df_train.diff().dropna())

"""
KPSS Statistic: 0.03236838154976303
p-value: 0.1
"""

With these results we can confirm that the series needs to be differentiated one time (d=1) to be stationary and to be used for forecasting.

The process of differentiation involves calculating the difference between consecutive values in the series. In this case, Figure 5 presents a visual comparison of the original time series data (left) alongside its corresponding differentiated version (right).

Figure 5: Original data vs. differentiated data

ACF and PACF

To set the best orders in the models, we run a pre-analysis of Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) on the differentiated series:

from statsmodels.graphics.tsaplots  import plot_acf, plot_pacf

# ACF
plot_acf(df_train.diff().dropna(), lags=25, alpha=0.05, ax=ax1)

# PACF
plot_pacf(df_train.diff().dropna(), lags=25, alpha=0.05, ax=ax2)
Figure 6: ACF and PACF plots for differentiated data

According to the ACF plot, it is possible to see that the lags of 7, 14 and 21 days have significant autocorrelation. As we are dealing with a daily frequency dataset, this behavior indicates a weekly relationship between the data. In view of the sinusoidal behavior in the PACF graph, the differentiated data indicates the pattern of an ARIMA (0, d, q) in the following steps.

Auto ARIMA model

The auto_arima function is designed to select the optimal ARIMA model. It will be used to find the values for p, q and Seasonal orders given the context of the previous tests like ADF, KPSS, ACF and PACF plots. Also, the explication of each parameter is inside the code box.

from pmdarima.arima                 import auto_arima

fit_arima = auto_arima(
df_train, # dataset
max_p=3, # limiting the number of autoregressive terms so the model don't overfit
max_q=3, # limiting the number of moving average terms so the model don't overfit
m=7, # number of periods in the seasonal cycle
seasonal=True, # try for seasonal data
seasonal_test='ocsb', # use the OCSB test for seasonality
d=1, # number of differencing equals to 1 because of ADF and KPSS results
trace=False, # to not print the progress of the model
information_criterion='bic', # choose the best model based on Bayesian information criterion
stepwise=False # testing all possible combinations
)

After analysing the results on fit_arima.summary(), the fitted model is used to forecast the next 14 days of revenue on test_forecast and its confidence interval of upper and lower limits on conf_int.

# Forecasting future values 
test_forecast, conf_int = fit_arima.predict(n_periods=14, return_conf_int=True)

The Figure 7 shows the 2018 data in blue, the results of the forecast in red, the confidence interval in light red and the real data in dashed black:

Figure 7: Auto ARIMA forecast

To evaluate the results, we will use Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE):

from sklearn.metrics  import mean_absolute_error, mean_squared_error

# Transforming the dataframes to numpy array
y_true = np.array(df_test)
y_pred = np.array(forecast_values)

# Creating metrics
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

"""
Mean Absolute Error (MAE): 4783.969017717254
Root Mean Squared Error (RMSE): 6284.0165495765505
Mean Absolute Percentage Error (MAPE): 17.10824494692397
"""
  • MAE: it is the average absolute difference between the predicted values and the actual values. In other words, the model’s predictions are off by approximately $4,783.97 on average.
  • RMSE: it is a measure of the average magnitude of the errors, giving more weight to larger errors. In this case $6,284.02.
  • MAPE: it measures the percentage difference between the predicted and the actual values. Therefore, the model’s predictions have an error of approximately 17.11%.

The sum of the actual values for the next fourteen days is $536,219.46, while the model indicated $519,067.91, a difference of $17,151.54 and a relative error of only 3.2%.

SARIMA model

Using the orders of auto_arima to compare the results with statsmodels.tsa.statespace.sarimax:

from statsmodels.tsa.statespace.sarimax  import SARIMAX

# Fitting the same orders from Auto ARIMA to compare the models
model = SARIMAX(df_train, order=(0, 1, 1), seasonal_order=(1, 0, 1, 7))

results = model.fit()

Plotting the forecast for the next fourteen days, we have:

Figure 8: SARIMA forecast

As expected, even though it is not the same, the prediction values are similar to the previous model.

# Transforming the dataframes to numpy array
y_true = np.array(df_test)
y_pred = np.array(forecast_values)

# Creating metrics
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

"""
Mean Absolute Error (MAE): 4978.981583607233
Root Mean Squared Error (RMSE): 6378.02869222759
Mean Absolute Percentage Error (MAPE): 17.44278310702825
"""

The metrics are slightly higher in statsmodels.tsa.statespace.sarimax than in pmdarima.arima.auto_arima.

As seen above, the sum of the actual values for the next fourteen days is $536,219.46. The SARIMA forecasting model indicated a sum of $525,548.79, a difference of $10,670.67 and a relative error of only 1.99%.
The sum for lower limit of the prediction was $251,496.34 and an upper limit of $799,601.23.

In general, both models performed well and produced trusted forecast results with very small relative errors for the next fourteen days of revenue. The project demonstrates the correct approach of finding the order of parameters using auto_arima and running it using the statsmodels library afterwards.

Next Steps

  • Study the impact of Black Friday on the data. Should it be removed? Should it be smoothened?
  • Apply the Ljung-Box test on residuals to check their autocorrelation
  • Test other models like XGBoost to compare the results
  • Use Cross Validation to improve the model
  • Incorporate other variables and use multivariate time series models like VAR and VEC to forecast

Do you have any opinion about it?

--

--