Predictive Modeling in R (Part 1)
How to start using the ARIMA model
About a year ago, we wrote about about the ARIMA model. We wanted to finally circle back and discuss creating a forecasting model using ARIMA and/or ETS methods rather than using more complex driver-based models that take many variables into consideration.
The ARIMA model only looks at the past output data for patterns.
Although this limits the ARIMA model in sense of what patterns it can look for, it’s still quite robust and allows for a lot of modification if you understand the various parameters that the ARIMA model has to offer.
The patterns in the previous observations (also called lags) are a key factor when developing this model — as it decides how many coefficients are required for the different parameters.
Selecting the correct number of parameters is tedious, as it can require you to test multiple different combinations. The ARIMA model can take three parameters for a nonseasonal time series and six when also considering seasonality.
These parameters are referenced as (p, d, q) for non-easonal data and (P, D, Q) for a time series that also has a seasonal component. In this article, we won’t be focusing heavily on these parameters. Instead, we’ll be discussing how R can be used to understand your data before your start manipulating these parameters.
R allows you to run a very simple script to create your own ARIMA model automatically (
auto.arima()). It’ll run the combinations it thinks is best for you, and it requires no understanding of ARIMA. This function has many default settings, so the only input required is the time series.
Using it on the
AirPassengers data set, the code could look like what is listed below.
The forecast actually seems pretty close.
However, it’s very important to realize the
auto.arima() function doesn’t always work as well as you might think. For instance, let’s take a look at the graph below.
auto.arima() function here doesn’t seem to work as well. The forecast doesn’t visually fit that well to the previous data, and it has a very wide confidence interval.
This is likely due to the fact the the overall pattern is correlated every two years, not just one. Without understanding the various parameters, you won’t be able to tweak the model in order to better fit the data.
Thus, it’s important to learn more about what the ARIMA model is and how you can use R to better understand the patterns in your time series (in our next piece, we’ll discuss the various parameters — as well as develop a better model for the model above).
R Programming and Stationarity
Stationarity is an important concept when developing forecasting models.
A stationary time series refers to a time series that has a consistent mean, variance, and covariance. To put it simply, this means the time series is somewhat predictable.
Attributes like trends and autocorrelation will influence a time series and often cause them to no longer have stationarity.
For instance, let’s look at the
EuStockMarkets data set. This data set contains the daily closing prices of major European stock indices: Germany DAX (IBIS), Switzerland SMI, France CAC, and UK FTSE. The data are sampled in business time — i.e., weekends and holidays are omitted(R Documentation).
It’s a great data set for understanding a random walk with drift. Looking at the graph below, this data set doesn’t have a consistent mean — meaning it’s most likely nonstationary
You can test this fact with several functions that R has. This includes the
acf() (autocorrelation) function and the
acf() function is used to test autocorrelation. Autocorrealtion is the correlation between different observations (lags). This could be the previous observation or even several lags prior.
The function call looks like this:
The output of this function depicts the correlation between a current observation and the observations following it. The blue dotted line represents a 95% confidence interval, and each solid line represents the correlation to the original lag.
A stationary data set will often have what we might reference as a exponential or nonexistent correlation. This means maybe one or two lags after the original observation will go beyond the blue dotted line and show a large amount of correlation.
This is not the case with the stock data. Instead, almost every following observation is correlated with the previous observation. This is a good sign that the data set is nonstationary.
Another method to test stationarity is to use
adf.test(). This function uses the augumented Dicky-Fuller (ADF) test. This test analyzes the data set for a unit root. A unit root can influence stationarity because unit roots make processes unpredictable.
The function call can be called with just the time series object referenced, as listed below.
The output can be a little confusing because the alternative hypothesis will always read stationary, and some people presume this to mean the data set is stationary. This isn’t the case. The output is merely informing you what the alternative hypothesis is.
However, in this case, we’ll reject the alternative hypothesis because the
p-value is almost 1 — when it should be less than or equal to .05. If the
p-value is less than or equal to .05, then you know your data set is likely stationary.
In this case, the stock data is nonstationary.
The Decompose Function
The ARIMA model relies on the various features it can pull out from the data provided. Using the decompose function is a great way to see some of the features that a data set has.
The decompose function separates a data process into three components: seasonality, trend, and a random component. In R, you can also adjust the decompose function to either use a multiplicative model or an additive model, like the one below.
To further understand what’s occurring in the decompose function, you can look at the below variables. They don’t represent numbers so much as functions.
S = Seasonality
T = Trend
e = Random(Error)
The additive model used is: Yt=Tt+St+et
The multiplicative model used is: Yt=Tt*St*et
The function first determines the trend using a moving average. Then, it elicits a seasonal component, and the remainder is the random component.
The ARIMA model is constructed in order to take trends into consideration. However, it’s always handy to see how to extract the trend using the moving-average (ma) function.
It further demonstrates how individual components can be pulled out of the data set and put back in.
Thanks for reading this explanation about using the
auto.arima() function. We hope it helps you as you work on designing your next model.