Calculating Non-Linear correlation using Distance correlation with examples

Moving ahead of Pearson’s correlation

Mehul Gupta
Data Science in your pocket

--

Photo by Dan Cristian Pădureț on Unsplash

When talking about the correlation between variables in a dataset, most of the time we jump onto the default implementation of calculating correlation in Python i.e. using either Numpy or pandas (at times with Scipy as well)

#numpy 
correlation = numpy.corrcoef(x,y)

#pandas
df.corr()

My debut book “LangChain in your Pocket” is out now

Now, a very important thing we often forget is both of these default implementations consider Pearson’s correlation which captures only linear relationships between variables. Hence, it may not be able to capture Non-Linear relationships between variables. So

It can capture relations like y=2x, y=0.5x

But not relations like y=2+2e(x²) , y=0.5x³ (non-linear relationships)

And also, if you have worked on some real-world problems, trust me you will rarely find perfectly linear relationships.

Another interesting fact is that other readymade implementations i.e. Spearman and Kendall Tau correlation metrics in NumPy/pandas/scipy also don’t capture non-linear relationships but are more related to the correlation between ordinal categorical data. I won’t be deep diving into the logical explanation of these metrics for now.

We have hit a dead end now !!

Or maybe not. I recently faced the same issue and figured out a very interesting metric that can help us capture the non-linear correlation between variables i.e. the Distance Correlation metric. So in my 106th post, I will be discussing this correlation.

A big advantage it has over Pearson’s correlation is it captures both Linear and Non-Linear relationships hence Pearson’s output can be taken as a subset of the Distance correlation metric. So it can be preferred over Pearson’s correlation any day.

Let’s deep dive into Distance correlation now

Distance correlation ranges from 0–1 where 0 means no correlation and 1 means perfect correlation. Negative or Positive? This is not indicated by Distance Correlation. So, a disadvantage we can see is it doesn’t have directional knowledge which is contrary to the Pearson correlation. We need to know a few concepts before moving ahead

Distance matrix

Assuming we have 2 sets of matrices,

  • Matrix A of dimension a x b (‘a’ vectors of size ‘b’ each)
  • Matrix B of dimension m x b (‘m’ vectors of size ’b’ each)

Then distance matrix is a ‘a x m’ matrix with each value holding distance between the corresponding vectors from both the matrix. The below example should work that calculates the distance matrix between some graph points.

https://en.wikipedia.org/wiki/Distance_matrix

The metric used for calculating distance can be Euclidean, or cosine similarity, etc

Distance Covariance

It is calculated using the below steps between 2 matrices X & Y

  • Calculated 2 distance_matrix, one for the pair (X, X) and the other for the pair (Y, Y). It can be called a sort of self-distance covariance. Let’s denote them by distx and disty. This can be done using scipy implementation of distance_matrix
from scipy.spatial import distance_matrix
distx = distance_matrix(X,X)
disty = distance_matrix(Y,Y)
  • Center distx and disty. How?

Loop over each value of the two distance_matrix and

  • Add mean for corresponding distance matrix.
  • Subtract the mean of the corresponding row and column of the distance matrix to which the value belongs from the distance_matrix value.

The below code should give some clarity (where n=matrix dimension)

import copy
centered_x,centered_y = copy.deepcopy(distx),copy.deepcopy(disty)
for x in range(n):
for y in range(n):
centered_x[x][y] = (
centered_x[x][y]
+ np.mean(distx)
- np.mean(distx[x])
- np.mean(distx[:, y])
)
centered_y[x][y] = (
centered_y[x][y]
+ np.mean(disty)
- np.mean(disty[x])
- np.mean(disty[:, y])
)

Here, centered_x and centered_y are nothing but deep copies of distx and disty where distx and disty are distance matrices calculated between X, X, and Y, Y as explained earlier. After centering the 2 distance_matrix,

  • Multiply the 2 dist matrices, sum all values, and divide by n² where n=total samples/vectors
  • Do a square root. We get our Distance Covariance.
math.sqrt(np.sum(centered_x * centered_y) / pow(n, 2))

Now, we know the 2 major pre-requisites, let’s calculate

Distance Correlation

Calculate distance covariance between variables A & B

Calculate standard_deviation.How?

#x,y are actual vector matrix, n=total samples
cov = distance_cov(x, y, n)
std = math.sqrt(self.distance_cov(x, x, n) * self.distance_cov(y, y, n))
distance_corr = cov/std

That’s it !!

How to implement this in Python?

We have a package called Dcor that can be used for the same

#pip install dcor
import dcor
#a,b are numpy array
dcor.distance_correlation(a,b)

Should we try out a few relationships to compare Pearson v/s Distance? Yes, we should. So what I have done below is define a few relationships between X & Y (linear and non-linear) and tried calculating the Pearson and Distance Correlation between variables. Let’s check out the results

"""generate linear relationship"""
x_l = np.linspace(0,10,100)
y_l0 = 2.0+0.7*x_l

"""generate exponential linear relationship"""
x_e = np.linspace(0,10,100)
y_e0 = np.exp((x_e+2) ** 0.5)*x_l


"""generate quadriatic relationship"""
x_q = np.linspace(-10,10,100)
y_q0 = 2.0+0.7*x_q**2 + 0.5*x_q


"""generate sinusoidal relationship"""
x_s = np.linspace(-3,2,100)
y_s0 = np.exp(-(x_s+2) ** 2) + np.cos((x_s-2)**2)

Now let’s see what values we get for Pearson and Distance Correlation

print('linear relationship')
print('pearson=',np.corrcoef(x_l,y_l0)[0,1])
print('dcor=',dcor.distance_correlation(x_l,y_l0))

print('\nexponential linear relationship')
print('pearson=',np.corrcoef(x_e,y_e0)[0,1])
print('dcor=',dcor.distance_correlation(x_e,y_e0))

print('\nquadritic relationship')
print('pearson=',np.corrcoef(x_q,y_q0)[0,1])
print('dcor=',dcor.distance_correlation(x_q,y_q0))

print('\nsinusoidal relationship')
print('pearson=',np.corrcoef(x_s,y_s0)[0,1])
print('dcor=',dcor.distance_correlation(x_s,y_s0))

The results

Analyzing the results

Both are pretty much able to capture linear and exponential linear relationship

For Quadriatic, dcor has outperformed pearson but still indicates a weak correlation only.

Similarly for sinusoidal, dcor is looking better but the numbers aren’t that strong to conclude a relationship.

Also, distance_correlation isn’t indicative of the direction of correlation (negative or positive) does make sense as relationships like sinusoidal won’t have a direction (think over this).

Enough mathematics for today, see you soon

--

--