Data Science tricks: Excited to find a 0.99 correlation? Check for uptrends first!

Ilias Flaounas
6 min readFeb 7, 2016

--

Have you recently measured any correlation of 0.99? Do you worry that it is surprisingly high? Probably you already know that correlation is a strange beast that can easily lead to wrong conclusions… In this blogpost I show you how to detect and correct a common pitfall that results in measuring unrealistic high correlations — that are significant and not due to multiple testing.

In Atlassian, a common task is to measure correlations between quantities that follow strong uptrends over time. For example, the correlation between the total volume of issues created in JIRA Cloud and the total number of JIRA Projects. Correlations like this comes out to be surprisingly high, up to 0.99. However, we know that this doesn’t make much sense. After all, we have seen that there’s some fluctuation in one quantity that is not really reflected to the other. So, why we get that high correlation?

In this article we explore this mysterious behaviour of correlation and discuss some work arounds in order to get some real value out of our measurements.

Let’s start by creating two uncorrelated — by construction — time series in R and measure their correlation (simple Pearson correlation):

set.seed(1000)
N <- 100
t <- 1:N
x1 <- rnorm(N, 0.2, 4)
x2 <- rnorm(N, 0.5, 4)
plot(t, x1, col=’red’, type=’l’)
lines(t, x2, col=’blue’)
cor(x1, x2)
Output: 0.059

Great, as expected we have a very low correlation (~0.059) between the two metrics since they are simply… random numbers.

Now, let’s make our example a bit more interesting. Let’s see what happens when the two time series start to follow some similar linear trend, for example, when they both increase by time:

x1 <- 2 * t + rnorm(N, 0.2, 4)
x2 <- 3 * t + rnorm(N, 0.5, 4)
plot(t, x1, col=’red’, type=’l’)
lines(t, x2, col=’blue’)
cor(x1, x2)
Output: 0.996

Oops… Remember that these series follow a random noise pattern and they should come up as unrelated to each other. However, in this example a we measure a whopping 0.996 correlation. Is it something wrong? Well… no. There is a cofound factor that affects both variables and makes them look correlated. Roughly speaking, this factor is simply the pass of time: both variables increase as times goes by. Maybe surprisingly, and contrary to popular belief, the quantities don’t even have to increase with the same rate to get a high correlation…

Indeed, any two increasing functions will have a very high correlation, actually close to 1. For example, let’s measure the correlation of two completely different time series, that both increase by time, one quadratic and one linear with some fluctuation:

a <- 2 * t + 3 * sin(t)
b <- t^2
plot(t, a, col=’red’, type=’l’)
lines(t, b, col=’blue’)
cor(a, b)
Output: 0.968

As expected, the correlation comes out very high, equal to 0.968.

Similarly, if one time series is always increasing and the other is always decreasing, then their correlation will be close to -1, again that is independently of the actual form of the functions. For example:

a <- 2 * t + 5 * sin(t)
b <- -t + 3 * cos(t)
plot(t, b, ylim=c(-100,100) ,col=’red’, type=’l’)
lines(t, a, col=’blue’)
cor(a, b)
Output: -0.99

So by now, we know that the two time series x1 and x2 are highly correlated, but that is just because they are both increasing over time. We don’t even have to measure. We can just visualise them and see that both of them “go up”.

Correction & re-calculation

Back to the first toy example. How, can we remove the effect of uptrend and compute the correlation of the fluctuations of the two time series? This comparison is much closer to how any non-statistician would interpret the notion of correlation.

There are few ways to make the correction, however a simple solution is to just compute the slope of each time series by linear regression and then remove the linear trend from each series. However, to do that, first we would need to confirm that the assumptions for applying linear regression hold. For example, we should confirm that there’s actually a linear trend and not a quadratic one. We create two new time series y1 and y2 from the x1 and x2 and then we compute their correlation:

set.seed(1000)
N <- 100
t <- 1:N
x1 <- 2 * t + rnorm(N, 0.2, 4)
x2 <- 3 * t + rnorm(N, 0.5, 4)
m1 <- lm(x1 ~ t)
m2 <- lm(x2 ~ t)
y1 <- x1 — m1$coefficients[2] * t
y2 <- x2 — m2$coefficients[2] * t
plot(t, y1, col=’red’, type=’l’)
lines(t, y2, col=’blue’)
cor(y1, y2)
Output: 0.14

Bingo! It is clear from the plot that the two time series are uncorrelated. Indeed, after our correction, the adjusted correlation drops down to 0.14. Thus, we can claim that after removing the linear trend from each time series, the residuals are not correlated.

The opposite problem: Low correlation for related metrics

Another aspect of the problem that might be useful is when one metric is lagging compared to another. This is kind of the opposite problem compared to what discussed before. For example, issues created might be lagging compared to MAU. How can you find and measure correlation in this case? The solution is measuring cross-correlation.

Let’s start by creating two time series that are by construction highly “correlated”. In this example one follows the other with a 5 steps delay (a step might be a day, a week or month):

set.seed(1000)
h <-5
N <- 100
t <- 1:N
x1 <- rnorm(N, 0.2, 4)
x2 <- x1[(h+1):length(x1)]
x1 <- x1[1:(length(x1) — h)]
t <- t[1:(length(t) — h)]
plot(t, x1, col=’red’, type=’l’)
lines(t, x2, col=’blue’)
cor(x1, x2)
Correlation = 0.12

Simple correlation is now low (0.12), even that the time series are so related to each other. To detect the presence of some lag we measure the cross-correlation of the metrics. That is simply the linear correlation for different lags of one time series compared to the other:

xcor <- ccf(x1,x2) 
max(xcor$acf)
Output: 0.95

Indeed, from the cross-correlation plot we can see that there’s a pick of 0.95 correlation for a 5 steps lag.

This approach would reveal that in the toy example above where we used the sin and cosine, the two time series are actually correlated if we take into account the relevant lag.

Take aways

  • Any uptrending or down-trending metrics are highly correlated, independently of their actual form. So, impress your peers by predicting >0.95 correlations all over the place: any quantities that increase over time are going to be highly correlated with each other. That includes global temperature, taxes and the size of the universe…
  • The value of the misleading high correlation is not significantly affected by the rate of the growth or the local fluctuations.
  • When this problem arises try as a solution to remove the uptrend and then measure the correlation of residuals.
  • If you find no correlation but you suspect or know that might be a lag between the metrics try measuring the cross-correlation and then adjust them accordingly.

Of course, before we make any safe conclusion on correlated data we should check a few more things. For example, we should check that we have enough data points, otherwise we may measure high correlation just because of chance — remember to measure significance if you have few data points. If we do find a high correlation, then this could be attributed not the direct relation of the quantities but just because they are both correlated to the same confounding factor.

--

--