Hunting Correlation In Pairs

Ganesh Jayadevan
The Startup
Published in
5 min readNov 20, 2020

I wasn’t very clear about the relationship between Pearson’s R vs. Spearman’s Rho. Researchers seem to favor one over the other based on correlation and linear/non-linear relationships between two variables with an intuitive understanding from plots.

Is there a way to do this a bit more systematically? Classify using a 2x2 matrix?

Being a fan of Cricket, then I thought, why not hunt in pairs like Lillee and Thomson or Walsh and Ambrose, or closer to home, Kumble and Bhajji. For those interested in bowlers who hunted in pairs take a look:

https://www.indiatimes.com/sports/it-s-all-about-hunting-in-pairs-meet-the-most-deadly-bowling-duos-in-cricket-history-264615.html

Will correlation be better explained if we took both Pearson’s R and Spearman’s Rho together as a pair?

But, first things first. What is a linear relationship vs. a non-linear relationship?

Simply put, a linear relationship between two variables A and B, (we can call it X, y if we like) follows the model of y = m*x+c + Error. Examples are Distance = Speed*Time+Error. In this case, we say Error has a particular quality of white noise to it. That is error is not systemic or dependent on X. So if I travel from Bangalore to Chennai on 100 occasions over a year at various times the error factor is likely to be randomly related by let’s say weather, accidents on the road, road construction, etc., but nothing related to the distance between Bangalore and Chennai.

In the previous example, we assumed constant velocity (or speed, if you will). On the other hand, here is an example of a non-linear relationship: From the top of a building, I throw a ball vertically downwards. Distance = u*t + (1/2)*g*t² + Error. g is the acceleration factor. Throw in acceleration and the relationship becomes non-linear. Speed is not constant anymore. One can reasonably argue this is closer to the real-world than the simple-minded case we discussed earlier. The same considerations for Error will apply as earlier described. In this case, it could be blowing wind, temperature, pressure, things outside our control — what we could consider as white noise.

When closed-form expressions like the above are not possible how does one tell the nature of the relationship between two variables?

What are Pearson’s R and Spearman’s Rho?

There is a good reference that explains R vs. Rho in detail:

https://towardsdatascience.com/clearly-explained-pearson-v-s-spearman-correlation-coefficient-ada2f473b8

The Analytical Approach

One can look at scatter plots, but if we had too many variables, a numeric and analytical method could speed up matters.

What are we trying to understand between two variables?

1. Is there a relationship between two variables?

2. How strong is the relationship between the variables?

3. if there is a relationship, is the relationship linear

4. If there is a relationship, is the relationship non-linear

Here is a table I will fill as we go along.

I came up with some of my own data-sets to examine the relationship and the goal is to try and fill the table. I use two variables A and B.

Case 1: Both A and B Are Not Correlated

Here is the dataset for variables A, B, and the corresponding ranking:

The scatter plot for A and B columns are as shown below:

Clearly, we see there is not much of a correlation between the two variables, and these are reflected in the two R and Rho:

So, we now get our first and second entries into our 2x2 matrix:

Case 2: Both A and B Are Linearly Correlated

The scatter plot for the below is as shown below:

As expected we find both Rho and R have high values.

So, we now to fill the third entry into our 2x2 matrix:

Case 3: Both A and B Are Non-linearly Correlated

I’ve included only 20 rows to be able to fit into one page.

The scatter plot shows two things:

  1. There is a correlation between A and B.
  2. The B values for A increase monotonically
  3. B is not linearly correlated with A

The scatter plot looks like the one below:

Let’s look at what Mr. Spearman and Mr. Pearson have to say:

So, we now to fill the fourth entry into our 2x2 matrix:

I say ‘low R’ but .61 is not really very low. I’ve spent a few hours trying to get a y=f(x) that has a Rho closer to 1 and an R-value less than .5, but I’ve had no luck.

If Rho comes, can R be far behind? Apologies to P. B. Shelly.

--

--

Ganesh Jayadevan
The Startup

Tech, Products, Innovation, newfound love for statistics in AI/ML and Big Data. And, just curious about the world.