“If you can look into the seeds of time, and say which grain will grow and which will not, speak then unto me. “ — William Shakespeare
Forecasting is by far the most important and frequently used application of predictive analytics because it has significant impact on both the top line and the bottom line of an organization. Every organization prepares long-range and short-range planning and forecasting demand for their product and services.
The time-series data Yt is a random variable, usually collected at regular time intervals and in chronological order. If the time-series data contains observations of just a single variable (such as demand of a product at time t), then it is termed as univariate time-series data. If the data consists of more than one variable, for example, demand for a product at time t, price at time t, amount of money spent by the company on promotion at time t, competitors’ price at time t, etc., then it is called multivariate time- series data.
A time-series data can be broken into the four following components:
1. Trend Component (Tt): Trend is the consistent long-term upward or downward movement of the data. T
2. Seasonal Component (St): Seasonal component (measured using seasonality index) is the repetitive upward or downward movement (or fluctuations) from the trend that occurs within a calendar year at fixed intervals (i.e., time between seasons is fixed) such as seasons, quarters, months, days of the week, etc.
3. Cyclical Component (Ct):Cyclical component is fluctuation around the trend line at random interval (i.e., the time between cycles is random) that happens due to macro-economic changes such as recession, unemployment, etc. Cyclical fluctuations have repetitive patterns with time between repetitions of more than a year. A major difference between seasonal fluctuation and cyclical fluctuation is that seasonal fluctuation occurs at fixed period within a calendar year, whereas cyclical fluctuations have random time between fluctuations. That is, the periodicity of seasonal fluctuations is constant, whereas the periodicity of cyclical fluctuations is not constant.
4.Irregular Component (It): Irregular component is the white noise or random uncorrelated changes that follow a normal distribution with mean value of 0 and constant variance.
In statistics, a moving average (rolling average or running average) is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. Let’s take an example
import pandas as pd
wsb_df = pd.read_csv(‘wsb.csv’)
Pandas has a function rolling() which can be used with an aggregate function like mean() for calculating moving average for a time window.
wsb_df[‘mavg_12’] = wsb_df[‘Sale Quantity’].rolling(window = 12). mean().shift(1)
To display values up to 2 decimal points, we can use pd.set_option and floating format %.2f.Now print the forecasted value from the month 37 onwards
pd.set_option(‘display.float_format’, lambda x: ‘%.2f’ % x) wsb_df[[‘Sale Quantity’, ‘mavg_12’]][36:]
Predicted values from moving average forecasting
Mean Absolute Percentage Error
Mean absolute percentage error (MAPE) is the average of absolute percentage error. Since MAPE is dimensionless, it can be used for comparing different models with varying scales. Let’s define a method for MAPE:
import numpy as np
def get_mape(actual, predicted):
y_true, y_pred = np.array(actual), np.array(predicted)
return np.round( np.mean(np.abs((actual - predicted) / actual))
Let’s invoke the above method using above DataFrame. “Sale Quantity” is passed as actual and mavg_12 is passed as predicted values. Records from the 37th month are used to calculate MAPE.
get_mape( wsb_df[‘Sale Quantity’][36:].values, wsb_df[‘mavg_12’][36:].values)
The MAPE in this case is 14.04. So, forecasting using moving average gives us a MAPE of 14.04%.
DECOMPOSING TIME SERIES
The time-series data can be modelled as addition or product of trend, seasonality, cyclical, and irregular components. The additive time-series model is given by
Yt =Tt +St +Ct +It
The additive models assume that the seasonal and cyclical components are independent of the trend component. Additive models are not very common, since in many cases the seasonal component may not be independent of the trend.The multiplicative time-series model is given by
Yt =Tt ×St ×Ct ×It
Multiplicative models are more common and are a better fit for many datasets.
For decomposing a time-series data, we can leverage the following libraries:
- statsmodel.tsa provides various features for time-series analysis.
- seasonal_decompose() in statsmodel.tsa.seasonal decomposes a time series into trend, seasonal, and residuals. It takes frequency parameters; for example, the frequency is 12 for monthly data.
from statsmodels.tsa.seasonal import seasonal_decomposets_decompse = seasonal_decompose(np.array(wsb_df[‘Sale Quantity’]), model=’multiplicative’,freq = 12)## Plotting the deocomposed time series componentsts_plot = ts_decompose.plot()
To capture the seasonal and trend components after time-series decomposition we use the following code. The information can be read from two variables seasonal and trend in ts_decompose.
wsb_df[‘seasonal’] = ts_decompse.seasonal
wsb_df[‘trend’] = ts_decompse.trend
AUTO-REGRESSIVE INTEGRATED MOVING AVERAGE MODELS(ARIMA)
Auto-regressive (AR) and moving average (MA) models are popular models that are frequently used for forecasting. AR and MA models are combined to create models such as auto-regressive moving aver- age (ARMA) and auto-regressive integrated moving average (ARIMA) models.
ACF: is a bar chart of the coefficients of correlation between a time series and lags of itself. It helps determine the value of p or the AR term. To plot the autocorrelation plot use:
PACF: a plot of the partial correlation coefficients between the series and lags of itself. Helps determine the value of q or the MA term. To plot the partial autocorrelation plot use:
Auto-regressive moving average (ARMA) is a combination auto-regressive and moving average process. ARMA(p, q) process combines AR(p) and MA(q) processes.The values of p and q in an ARMA process can be identified using the following thumb rules:
- Auto-correlation values are significant for first q values (first q lags) and cuts off to zero.
- Partial auto-correlation values are significant for first p values and cuts off to zero.
ARMA models can be used only when the time-series data is stationary. ARIMA models are used when the time-series data is non-stationary. Time-series data is called stationary if the mean, variance, and covariance are constant over time. ARIMA has the following three components and is represented as ARIMA (p, d, q) where d is the Integration component. Let’s take an example of store data:
store_df = pd.read_excel(‘store.xls’)
The dataset contains 115 days of demand per day data. We can convert the column into DateTime index, which is a default input to time-series models. Creating a DataTime index will make sure that the data is ordered by date or time. Finally, plot the demand against the date using plot() method:
store_df.set_index(pd.to_datetime(store_df.Date),inplace=True) store_df.drop(‘Date’, axis = 1, inplace = True)
plt.xlabel( “Date” )
plt.plot( store_df.demand );
We can use Dicky-Fuller test to test whether the time series is stationary or not. statsmodels.tsa.stattools.adfuller is a Dicky−Fuller test and returns test statistics and p-value for the test of the null hypothesis. If the p-value is less than 0.05, the time series is stationary.
from statsmodels.tsa.stattools import adfuller
adfuller_result = adfuller(ts, autolag=None)
adfuller_out = pd.Series(adfuller_result[0:4],
index=[‘Test Statistic’, ‘p-value’,
‘Number of Observations Used’])
The p-value (>0.05) indicates that we cannot reject the null hypothesis and hence, the series is not stationary. Differencing the original time series is an usual approach for converting the non-stationary process into a stationary process (called difference stationarity).
store_df[‘demand_diff’] = store_df.demand - store_df.demand.shift(1) store_df.head(5)
The differencing for the first day is nan as we do not have previous days’ demand. We can drop this value from the observation before verifying stationarity.
store_diff_df = store_df.dropna()
plt.ylabel(“First Order Differences”)
Again, it is not very apparent from the trend if the series is stationary or not. We will need to plot the ACF plot to verify:
acf_plot = plot_acf(store_df.demand_diff.dropna(), lags=10)
It immediately cuts off to zero. We can build the model with first 100 observations as training set and subsequent observations as test set. Here we choose p,d,q using above thumb rules
store_train = store_df[0:100]
store_test = store_df[100:]arima = ARIMA(store_train.demand.astype(np.float64).as_matrix(), order = (1,1,1))
arima_model = arima.fit()
The forecast() method takes the number of future periods (steps) for predicting values and returns the predicted values along with standard error and confidence interval of the predictions.
store_predict, stderr, ci = arima_model.forecast(steps = 15) store_predict
24.17: Forecast accuracy of the ARIMA model with first-order differencing.