Is correlation distance a metric?
In the absence of a distance, “close” and “far” are meaningless. To define these notions over a set of abstract mathematical objects, we need to be able to measure the distance between each pair of them. The question is: If the abstract mathematical objects are random variables, then how should we measure the distance between them?
Correlation distance is a popular way of measuring the distance between two random variables with finite variances¹. If the correlation² between two random variables is r, then their correlation distance is defined as d=1-r. However, a proper distance measure needs to have a few properties, i.e. should be a metric, and it is not trivial whether correlation distance has these properties. In this note, we ask whether correlation distance is a metric or not.
Recap: What is metric?
Consider we want to define a distance measure between the elements of the set Ω. Then, a metric (a proper distance measure) is a function d:Ω×Ω →R⁺ with the following properties:
- If the distance of two objects is zero, then they are the same, and vice versa; i.e. d(x,y) = 0 iff x = y.
- It is symmetric, i.e. d(x,y)=d(y,x).
- It satisfies triangular inequality, i.e. d(x,y)≤d(x,z)+d(z,y).
Since correlation is symmetric, the 2nd property is obviously satisfied for correlation distance. We hence need to study the other two.
1st property: Identity of indiscernibles
Consider the random variables X₁ and X₂ with correlation r₁₂. Then the correlation distance d₁₂ = 1 - r₁₂ is zero if and only of r₁₂ = 1. At the same time, the correlation between X₁ and X₂ is one if and only if there exists a>0 and b∈ R such that X₁=aX₂+b.
In other words, d₁₂ is zero if and only if X₂ can be transformed into X₁ by only shifting and scaling. This particular feature of correlation distance makes it pretty interesting for the cases that we need a shift- and scale-invariant distance measure. However, this feature makes it impossible for correlation distance to be a metric over the set of all random variables with a finite variance; rather, it can still be a distance over the set³ of normalized random variables (i.e. the ones with zero mean and unit variance).
Conclusion: Correlation distance has the 1st property over the set of normalized random variables.
3rd property: Triangular inequality
The statement of the triangular inequality is very self intuitive: The direct line from your bed to your desk is the shortest path for going from the bed to the desk. If correlation distance wants to satisfy this property, then the correlations of any three random variables X₁, X₂ and X₃ have to satisfy the inequality
As a consequence, if r₂₃ and r₁₃ are equal to 0.5, then r₁₂ has to be greater than or equal to 0. It is easy to find examples of random variables for which this condition is not satisfied; see the 3rd scenario in my previous note on "a misinterpretation of correlations". In other words, in a world where the distances are measured by correlation distance, you may find a shorter path from your bed to your desk if you first go to your couch, and then, from there go to your desk!
Conclusion? Correlation distance does not satisfy the 3rd condition, and it is not a proper metric.
How to make it a metric?
Over the set of normalized random variables, it is easy to show that the Euclidean distance can be expressed in terms of correlations as
Euclidean distance is a metric; Euclidean distance is (proportional to) the square root of correlation distance. Therefore, the square root of correlation distance is a metric.
Correlation distance does not satisfy triangular inequality and hence is not a metric. However, its square root is a metric over the set of normalized random variables.
I am grateful to Hamed Nili for his feedback on this text as well as our useful discussions which were the main source of my motivation for writing this note.
¹ Correlation distance is widely used for clustering, it has applications in e.g. Neuroscience and Bioinformatics, and it is also available in programming languages as a distance option, e.g. in MATLAB pdist function.
² In this text, I always mean Pearson correlation by correlation.
³ Correlation distance can also be considered as a distance measure over the set of equivalence classes of random variables, when the random variables X and Y relate to each other whenever there exists a>0 and b ∈ R such that X=aY+b.
Appendix: A comment on vector representation
Consider a set of N normalized random variables with the correlation matrix Σ. If we consider the nth row of the squared root of Σ, which is an N-d vector on the N-d unit ball, as the vector representation of the nth random variable, then the Euclidean distances between these vectors (which are the same as the square root of the cosine distances between them) remain the same (by ignoring the scale) as the square root of the correlation distances between the corresponding random variables.