The Correlation Controversy

Part II: The Problem

Published in

Amdax Asset Management

5 min readAug 5, 2022

In Part I of this article series, I discussed the relevance of defining a crypto asset class and why the correlation of price dynamics plays a relevant part in this context. This article will dive into the complexities of correlation. I will explain why it is so hard to come up with a single measure that tells us to what extent two time series interact with each other.

An example

To best illustrate the problem, consider the two self-generated (synthetic) daily time series shown in the figure below. At first glance, they look more or less perfectly correlated. Calculating a sample correlation coefficient using the raw (untransformed) series gives 0.97. This fully supports our expectation, so it should be correct, right? Well, no. This approach misses out on one crucial aspect which can lead to not only an unnuanced conclusion, but even a completely incorrect one. I will explain the caveat using two different arguments: empirical and theoretical.

Empirical argument

For the empirical line of thinking, let’s try to work out what correlation actually means. Start by taking all the days on which y1 increases. If y1 and y2 are positively correlated, we of course expect y2 to also increase on most of those days. Negative correlation would imply that y2 generally moves in the opposite direction of y1. If the series are independent, y2 increases on around half of the days and it decreases on the other half. A similar logic obviously also applies to days on which y1 decreases.

Let’s apply this to our example. y1 increases on 295 out of 577 days. And out of these 295 days, y2 only increases on 79 days. In other words, y2 only increases on 27% of all days on which y1 increases. Similarly, y2 also only decreases on 26% of all days on which y1 decreases. This means that the two time series move in opposite directions almost 75% of the time. Anyone would thus conclude that they are more negatively than positively correlated.

Theoretical argument

The theoretical explanation boils down to one of the underlying assumptions of the formula for calculating a sample correlation coefficient. In order for the formula to be reliable, both time series must have a time-invariant sample mean. In other words, the mean of the time series should be the same for all subperiods within the full sample period. This is clearly not the case for our synthetic series. Their sample mean is somewhere between 0 and 5 in the first four months and ends up well above 20 at the end of the sample period.

To put it even simpler: the sample correlation coefficient effectively calculates how often both series are on the same side of their sample mean at the same time. Now, y1 and y2 share the same sample mean at around 14.5. We can easily see that both series are both below 14.5 in the first two months and above 14.5 in the last three months. Evidently, applying this measure of correlation seems highly unnuanced, as any two series which happen to trend in the same direction would be almost perfectly positively correlated.

Searching for a solution

Both the empirical and theoretical reasoning point in the same direction. In order to identify true correlation, we should consider day-to-day changes rather than the actual raw values. Using first differences in the sample correlation coefficient yields a value of -0.69, confirming that y1 and y2 are negatively correlated.

Regrettably, this seemingly easy fix comes with its own problem. Because even though y1 and y2 generally trend upwards in the long-term, they both experience a downtrend at the same points in time. This has to account for something in our search for correlation, right? Taking first differences distorts our view such that we are no longer able to recognise these features on the medium-term. To fix this “short-sightedness”, let’s take month-on-month differences and see what happens with the correlation coefficient. It now equals 0.24. Hence, on a monthly basis, we find that y1 and y2 move much more independently from each other.

So what is it? Are y1 and y2 negatively correlated, independent, or were they positively correlated all along? The answer probably lies somewhere in the middle. It can even vary over time. This is the reason that a lot of analysts turn to rolling correlations. Such an approach would look something like this in our case:

Clearly, rolling correlations can vary significantly over time, with the 30-day correlations ranging somewhere between -0.85 and -0.45. This makes it practically impossible to pinpoint a number that captures the co-movement of two time series over a certain time period. Moreover, this approach introduces additional ambiguity to the measure, because we can opt for rolling window sizes of 30 or 90, but any other number might also be a valid choice. And of course, no one knows which size is correct and any choice has its own implications.

Conclusion

To side with the critical reader, I definitely constructed these time series such that they would yield the seemingly contradictory result I needed to prove my point. But the fact that I am able to do so is enough to demonstrate how complex the principle of correlation can be in extreme cases. After all, there is no reason to believe that the unpredictable world we live in cannot produce something as paradoxical as this Python script.

In this article, we have touched upon the most basic aspects of correlation. Transforming raw series into differences validates the use of the sample correlation coefficient formula. But for a good correlation analysis, we should consider multiple timeframes and maybe even time-varying properties. But which complexities apply to real world problems? In particular, how do the findings in this article help us to identify the dependence or independence between crypto and other asset classes? We’ll find out in the next part.

Appendix

In this simulation study, I created two synthetic time series according to the following equations:

The code that can be used to reproduce my findings can be found below.

from datetime import datetime
import pandas as pd
import numpy as npdates = pd.date_range(datetime(2022, 1, 1), datetime(2022, 7, 31))
n = len(dates)
random_seed = 3
rng = np.random.default_rng(random_seed)
mu = np.array([0, 0])
sigma = np.array([[1, -1],
                  [-1, 2]])
epsilon = rng.multivariate_normal(mu, sigma, size=n)
t = 0.05*np.arange(0, n)
y1 = pd.Series(t + 2*np.sin(t) + epsilon[:, 0], index=dates)
y2 = pd.Series(t + 2*np.sin(t) + epsilon[:, 1], index=dates)