# Cointegration for Time Series Analysis

## Second article in “Measuring Correlation”

Stationarity is a crucial property for time series modeling. The problem is, in practice, very few phenomena are actually stationary in their original form. The trick is to employ the right technique for reframing the time series into a stationary form. One such technique leverages a statistical property called cointegration. Cointegration forms a synthetic stationary series from a linear combination of two or more non-stationary series.

We’ll use simulated data to demonstrate the main points behind cointegration in R. The sources Tsay , Pfaff et al. , and Pfaff and Taunus  were used to put together the following demonstration, along with the below three R packages.

`require(urca) require(vars) require(ggplot2) `

# Background

Let’s say you have a group of time series that all follow random walks. You can think of cointegration as finding which series tend to “randomly walk together” and whose spread (difference between both series at each time step) is stationary. Cointegration tells you that, although two series move independently, the average distance between them remains relatively constant.

More formally, two series are cointegrated if they are both individually unit-root nonstationary (integrated of order 1: I(1)) but there exists a linear combination that is unit-root stationary (integrated of order 0: I(0)). If any of the individual series are already stationary, then cointegration would be redundant since the linear combination would heavily emphasize the stationary series.

Although the correlation coefficient and cointegration both describe some underlying relationship between variables, the two properties are not synonymous. It is very possible for two time series to have weak/strong correlation but strong/weak cointegration.

Strong correlation and no cointegration

The two series are clearly correlated but the difference between them changes with time.

`# Strong correlation and no cointegrationset.seed(1)y1 = cumsum(rnorm(150,0.5))y2 = cumsum(rnorm(150,1))Y <- data.frame(cbind(y1,y2))ggplot(Y, aes(x=1:nrow(Y))) +  geom_line(aes(y= y1),colour = "red") +  geom_line(aes(y= y2),colour = "green") +  labs(title = "Correlation vs Cointegration") +  xlab(expression("Time")) +  ylab(expression("Y")) +  theme(plot.title = element_text(hjust = 0.5))cor(Y)`

Strong correlation and no cointegration

The two series are perfectly correlated and cointegrated since the difference between the two doesn’t change with time.

`# Strong correlation and no cointegrationy1 = sin(seq(1,100,1))y2 = 1.5+sin(seq(1,100,1))Y <- data.frame(cbind(y1,y2))ggplot(Y, aes(x=1:nrow(Y))) +  geom_line(aes(y= y1),colour = "red") +  geom_line(aes(y= y2),colour = "green") +  labs(title = "Correlation vs Cointegration") +  xlab(expression("Time")) +  ylab(expression("Y")) +  theme(plot.title = element_text(hjust = 0.5))cor(Y)`

No correlation and strong cointegration

Adding a slight phase shift to the above removes all correlation but still preserves cointegration.

`# No correlation and strong cointegrationy1 = sin(seq(1,100,1))y2 = 1.5+sin(seq(pi/2+1,pi/2+100,1))Y <- data.frame(cbind(y1,y2))ggplot(Y, aes(x=1:nrow(Y))) +  geom_line(aes(y= y1),colour = "red") +  geom_line(aes(y= y2),colour = "green") +  labs(title = "Correlation vs Cointegration") +  xlab(expression("Time")) +  ylab(expression("Y")) +  theme(plot.title = element_text(hjust = 0.5))cor(Y) # correlation = -0.002733781`

The below code snippet simulates three time series that share the same underlying random walk process. The plot shows that, although individual random walks, the three series seem to share a common underlying process.

`iterations <- 1000Y <- setNames(as.data.frame(matrix(0, iterations, 4)), c("y", "y1", "y2", "y3"))set.seed(1)for (ii in 2:iterations) {  Y\$y[ii] <- y[ii-1] + rnorm(1)  Y\$y1[ii] <- Y\$y1[ii-1] + (0.01*(Y\$y[ii-1] - Y\$y1[ii-1])) + rnorm(1)  Y\$y2[ii] <- Y\$y2[ii-1] + (0.05*(Y\$y[ii-1] - Y\$y2[ii-1])) + rnorm(1)  Y\$y3[ii] <- Y\$y3[ii-1] + (0.1*(Y\$y[ii-1] - Y\$y3[ii-1])) + rnorm(1)}cor(Y)#            y        y1        y2        y3# y  1.0000000 0.4926226 0.9121291 0.9673797# y1 0.4926226 1.0000000 0.5843223 0.5284070# y2 0.9121291 0.5843223 1.0000000 0.9339683# y3 0.9673797 0.5284070 0.9339683 1.0000000`

The correlation matrix shows that those with larger cointegration coefficients (0.01, 0.05, and 0.1) show stronger correlation with the originating series.

`ggplot(Y, aes(x=1:iterations)) +  geom_line(aes(y= y),colour = "black") +  geom_line(aes(y= y1),colour = "red") +  geom_line(aes(y= y2),colour = "green") +  geom_line(aes(y= y3),colour = "blue") +  labs(title = "Simulated Series with Shared Underlying Random Walk") +   xlab(expression("Time")) +  ylab(expression("Y")) +  theme(plot.title = element_text(hjust = 0.5), plot.legend = TRUE)` Originating random walk (black), first cointegrating series (red), second cointegrating series (green), third cointegrating series (blue)

The time series with higher cointegration coefficients in their equation more closely follow the originating series. Despite the lower correlation for the first series (red), there does seem to be some type of underlying force keeping the two series in range. That behavior is even stronger for the green and blue series.

Now we’ll review the actual techniques to test cointegration beyond mere visual inspection.

# Cointegration Testing

This is the original procedure for testing cointegration developed by Robery Engle and Clive Granger in their seminal paper Engle and Granger . The 2-step procedure is easy to follow and paints a good picture of the general idea behind cointegration. The idea is to first verify that the individual series are indeed integrated of order I(1) (being non-stationary). Then we regress one on the other using standard OLS and check if the residual series is integrated of order I(0) (suggesting stationarity). If the residuals are stationary, then we can extract the coefficients from the OLS model and use those to form a stationary linear combination of the two time series.

We’ll run this on the first two cointegrated series. First let’s test each series to confirm that they are indeed non-stationary (there exists a unit-root) using the popular ADF test.

`y1Adf <- ur.df(Y\$y1) #ADF test: H0 of having a unit-root (nonstationary) y1Adf@cval#       1pct  5pct 10pct# tau1 -2.58 -1.95 -1.62y1Adf@teststat # fail to reject null of unit-root#                tau1# statistic -1.609493y2Adf <- ur.df(Y\$y2) y2Adf@teststat # fail to reject null of unit-root#                 tau1# statistic -0.3265977`

The above ADF tests confirm that the series are nonstationary. We next fit a linear model using OLS and check the residuals for stationarity

`m1 <- lm(Y\$y1 ~ Y\$y2) m1\$coefficients# (Intercept)        Y\$y2 #  -4.0986059   0.3059797wt <- m1\$residuals spreadAdf <- ur.df(wt) spreadAdf@cval#       1pct  5pct 10pct# tau1 -2.58 -1.95 -1.62spreadAdf@teststat # reject H0 of unit-root (conclude it is stationary)# tau1# statistic -2.982549`

The test statistic for the spread (wₜ) is smaller than the test statistic at all three confidence levels. We are comfortable with assuming that we now have a stationary series to work with. Because we are unable to reject the null hypothesis of stationarity, the Engle-Granger test suggests a cointegrating relationship exists. The cointegrating equation is given by:

There are a few shortcomings of the Engle-Granger procedure. The first being that it is incapable of simultaneously testing for cointegrating relationships among multiple series. Additionally, regressing y2 on y1 instead of y1 on y2 (as we did above) would yield a different cointegrating vector. Thus, the choice of which series to regress on the other is somewhat arbitrary. The other shortcoming that it overlooks the underlying error-correction model influencing the spread series.

Conveniently, all of the above shortcomings can be addressed through the Johansen procedure. This procedure estimates cointegrated VARMA(p,q) in the VECM (vector error-correction model) form for m cointegrating relationships between k different series in xₜ.

In the above, α and β are both k x m matrices, ∆xₜ represents the first difference as ∆xₜ= xₜ − xₜ₋₁, Φi are the AR coefficients, and Θj are MA coefficients. The cointegrating equation is defined by β′x{t−1}, where β contains the coefficients for the m cointegrating vectors.

An added benefit of the VECM representation is its ability to generate forecasts with the cointegrating relationship taken into account. Creating these forecasts are beyond the scope of this post but might be covered in a later one.

The Johansen test carries out eigenvalue decomposition on the Π matrix and sequentially tests its rank r to determine how many cointegrating relations exist. The null hypothesis, being r = 0, would mean that no such relations exist if we fail to reject it. Then it goes on to test r ≤ j ∀j < k where again k are the number of series being tested (we can not have k valid cointegrating relations). If we are only testing two series, then the two tests are r = 0 and r = 1.

The eigenvalue decomposition produces eigenvectors from which the components corresponding to the largest eigenvector are used for the cointegrating coefficients β.

`cot=ca.jo(Y[,c(2:4)],ecdet="const", type='trace', K=2, spec='transitory') summary(cot)# ###################### # # Johansen-Procedure # # ###################### # # Test type: trace statistic , without linear trend and constant in cointegration # # Eigenvalues (lambda):#  3.947760e-02 2.063349e-02 8.268313e-04 4.336809e-19# # Values of teststatistic and critical values of test:# #           test 10pct  5pct  1pct# r <= 2 |  0.83  7.52  9.24 12.97# r <= 1 | 21.63 17.85 19.96 24.60# r = 0  | 61.83 32.00 34.91 41.07# # Eigenvectors, normalised to first column:# (These are the cointegration relations)# #              y1.l1      y2.l1       y3.l1   constant# y1.l1     1.000000  1.0000000   1.0000000  1.0000000# y2.l1     2.918063 -1.0330340   0.1625347  0.2206172# y3.l1    -3.830987  0.5674745   0.8981619  0.1276251# constant -1.536007  2.8054993 -10.6323941 15.9087882# # Weights W:# (This is the loading matrix)# #              y1.l1        y2.l1         y3.l1     constant# y1.d -0.0016638564 -0.020254936  0.0002125480 5.487152e-18# y2.d -0.0111657363  0.003239633 -0.0001845491 1.099864e-18# y3.d  0.0007776353 -0.005012261 -0.0010935382 3.262192e-18`

Looking at the test statistic and critical values, we see that we fail to reject the r <= 2 test but we can reject the r <= 1 case at the 0.05 level of significance. Therefore, we can accept the alternate hypothesis that the rank of the matrix is r=2. This means that we can form a stationary series from a linear combination of just two out of the three series tested.

The eigenvectors corresponding to the largest eigenvalue (3.947760e-02) are [1, 2.918063, -3.830987, -1.536007], which can be found under column y1.l1 in the Eigenvector section . This leads to our cointegrating equation

The ADF test also supports our resulting series being stationary.

`xMat <- t(data.matrix(cbind(Y[c(2:4)],ones)))beta <- data.matrix(cot@V[,1])wt <- t(beta) %*% (xMat) # wt = Y\$y1 + 2.918063*Y\$y2 - 3.830987*Y\$y3 - 1.536007plot(t(wt), type="l", main="Stationary Spread Series",     xlab="Time", ylab="Spread")wtAdf <- ur.df(t(wt))wtAdf@cval#      1pct  5pct 10pct# tau1 -2.58 -1.95 -1.62wtAdf@teststat # fail to reject null of unit-root#                tau1# statistic -4.037032`

# References

Robert F Engle and Clive WJ Granger. Co-integration and error correction: representation, estimation, and testing. Econometrica: journal of the Econometric Society, pages 251–276, 1987.

Bernhard Pfaff and Kronberg im Taunus. Using the vars package. Kronberg: im Taunus, page 2007, 2007. Bernhard Pfaff et al. Var, svar and svec models: Implementation within r package vars. Journal of Statistical

Software, 27(4):1–32, 2008.
Ruey S Tsay. Analysis of financial time series, volume 543. John Wiley & Sons, 2005.

## Ro Data Team Blog

#### Ro Data Team Blog: data analytics, data engineering, data science 