R provides a variety of tools to manage, analyze and make forecasts about time series. In this article, I will use the UK FTSE time series (I’ll extract it from the EuStockMarkets dataset, available on R environment). However, everything I will say will be consistent with any time series, regardless of its being financial.
Before starting the analysis, it is worth spending some lines about the models we are going to employ. They belong to the class of Autoregressive Moving Average or ARMA models. The idea of ARMA models is that, given a time series Y, the value of Y at time t might depend on its past values (this is the autoregressive component) and on its past error terms, which are supposed to be white noise (this is the moving average component).
The number of parameters for each component is represented by p and q, while the parameter c is called “drift”.
DDI Editor's Pick: 5 Machine Learning Books That Turn You from Novice to Expert | Data Driven…
The booming growth in the Machine Learning industry has brought renewed interest in people about Artificial…
In general, every statistical model is estimated following a defined method (namely, Regression is estimated with the Ordinary Least Square method). In our case, ARMA models are estimated according to the Maximum Likelihood Estimation (MLE). The ground idea is very intuitive: the MLE of a vector of parameters of a given sample is the value for which this sample is more likely to be observed.
The fitting process of the model will return the estimated coefficients. However, we need to know the order (p,q) of that model. To do that, the first step of time series analysis (and I’d add, of any data analysis task) is studying and getting familiar with our data, inspecting them and formulating some first assumptions.
Hence, let’s start by plotting the UK FTSE time series.
As you can see, there is a clear upward trend, even though the index experienced a sharp fall around the end of 1998. One might be interested in forecasting whether or not this fall is temporary or, in other words, future values of the index will be increasing or decreasing.
Some interesting information can be gathered by splitting the series into its components:
components.ts = decompose(ftse)
Now a further concept needs to be introduced: stationarity.
Ideally, we want to manage series which are stationary (that means, having constant mean and variance over time) in order to bypass the main concern of time series: the impossibility of observing, at a fixed time t, all the possible realizations of the stock/index. Indeed, we only observe one path of realizations over time of any given stock/index.
There are different ways to make a time process stationary. One of these is simply taking its first difference, once subtracted the seasonal component:
x = ftse- components.ts$seasonal
ftse_stationary <- diff(x, differences=1)
It now looks more stationary. Well, these interventions we made -getting rid of seasonality and taking the first difference- can be implemented directly into the model, building a so-called SARIMA (Seasonal Autoregressive Integrated Moving Average). Indeed, the “integrated” component considers the dth difference of the series, while the “seasonal” component takes into account seasonality.
So, as I said before, the last step before training the model is deciding the number of parameters p,q and, in this specific case, d (order of difference we have to take to make our series stationary). A good way to proceed is inspecting the Autocorrelation Function and Partial Autocorrelation Function of our series:
acf(ftse_stationary,lag.max = 40)
pacf(ftse_stationary,lag.max = 40)
These functions tell us whether time series values exhibit some kind of correlations with their past values and, if yes, up to which time lag. These serial correlations can be computed taking into account all the correlations among values between t and t-n (ACF) or considering only the correlation between values at t and t-n (PACF).
How can one interpret these two plots? The basic recipe is the following:
The shapes of our ACF and PACF seem to suggest a Moving Average process. However, it is always a good practice to train more than one model and compare them according to given criteria (I will dwell on this concept later on). Luckily, there is a function in R which will make this comparison in a few seconds, returning the best combination of parameters (again, we still have to define what “best combination” means).
Before employing this function, let’s train a very simple ARIMA model by choosing low values of p, q and d.
fitARIMA = arima(ftse, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
The very first operation to be done once trained the model is inspecting its residuals. Indeed, it is fundamental, for our model to be accurate enough, that residuals are stationary, without a specific trend. Indeed, if there was a trend, it would mean there is still a pattern the model didn’t capture, there is some available information that has not been used. If residuals are not stationary, hence our model needs more (or a different combination of) parameters.
Let’s examine our residuals:
They look pretty stationary, but we need to test it. To do that, I will employ the Ljung-Box test, whose hypotheses are:
H0: The data are independently distributed
H1: The data are not independently distributed (hence, they exhibit serial correlation)
Let’s examine the output:
As you can see, the p-value excludes the possibility of serial correlations, as it is greater than any significant level of alpha.
So our first assumptions were good enough to build a model returning independent residuals. Now, as anticipated, let’s examine the best model using the auto.arima() function. It returns the best ARIMA model according to either AIC, AICc or BIC value. These latter are three different information criteria: they bypass the risk of excessive complexity that would emerge if we relied only on the MLE (which tends to increase the number of parameters to maximize the joint probability of occurrence). Indeed, these criteria penalize the MLE function when the number of parameters increases, following the “law of parsimony” that became a fundamental principle of statistical theory (to quote Williams of Ockham (1287–1347), best models are those which are simple and fit the data well).
So, it seems the most accurate model is an ARIMA(0,1,1) with drift.
Let’s visualize the residuals:
And, again, let’s run the Ljung-Box test for double-checking the absence of serial correlation:
Great, our function found a good model. Now, we can jump to the final part of this analysis: making predictions about future outcomes.
predicted_values = forecast(model,h=200, level=c(99.5))
For the sake of clarity (and to verify the concrete forecasting ability of our ARIMA) I’m attaching the real UK FTSE time series (the red circle highlight the forecasted values):
Well, of course, these predictions are not 100% accurate, and it would be unusual if they were. Indeed, predicting the exact stock and index values is utopian, since the variables involved are too many and often not measurable (think about market sentiment, natural catastrophic events, spreading of scandals…).
Nevertheless, the model managed to capture the sudden inversion of the trend, from downward to upward. Furthermore, the real UK FTSE values lay within the confidence bounds of 99.5% (the area in light grey).
There is a variety of further analyses and predictions you can implement with R which I didn’t cover in this article. Yet the few tools I employed show the potentiality and utility of R for time series analysis. Finally, remember that you can always combine different libraries and customize your results (namely, building an interactive graph with time sliders using plotly).
If you are interested in the ‘pythonic’ version of this article, you can check it here. Hope you enjoyed the reading!