Spearman’s correlation analysis for paired data

Spearman’s correlation and significance test

Nikhil Raghute
Analytics Vidhya
6 min readDec 25, 2019

--

Photo by Gláuber Sampaio on Unsplash

Reference: SNACKS data

a) Find the Spearman correlation matrix of all the ordinal attributes.

b) Determine the coefficient of determination.

c) Interpret the results from the two tables.

d) In each case, perform the significance test with a 95% confidence level.

We’ll see the concepts of Spearman’s correlation coefficient, coefficient of Determination, significance test.

Correlation: is used to measure some form of association between two variables, how strongly pairs of variables are related.

  • r = 0, implies there is no correlation .
  • r = +1 (perfect positive correlation).
  • r = -1 (perfect negative correlation).
  • The value of r nearer to +1 or -1 indicates a high degree of correlation between the two variables.

Charles Spearman’s coefficient of correlation:

  • It is used to find a correlation coefficient between two ordinal attributes.
  • This correlation measurement is also called the Rank correlation.
  • This technique is applicable to determine the degree of correlation between two variables in the case of ordinal data.
  • It assesses how well the relationship between two variables can be described using a monotonic function.

We can find rs as follows:

First, we’ll calculate rank in respective columns, taking differences of their ranks and summing the square of differences of their ranks.

After that, we can use the below formula,

Where, di=Difference between ranks of i^th pair of the two variables

n=Number of pairs of observations.

-1 <= rs <=1

Coefficient of Determination:

  • It is used to measure the proportion of the variability of the fitted model.
  • It is square of correlation(r), thus varies between 0 and 1.
  • An R2of 0 means that the dependent variable cannot be predicted from the independent variable.
  • An R2 of 1 means the dependent variable can be predicted without error from the independent variable.

Significance Test:

We can carry out the significance test in 5 steps:

Step 1- Defining a Hypothesis.

Step 2- Finding rs (using their ranks).

Step 3- Finding rs value from the Spearman’s table/graph for a given DOF and significance level.

Step 4- Verifying if calculated rs is higher or lower than rs from the table/graph.

Step 5- Rejecting(if calculated rs is higher) or Fail to reject the hypothesis.

And the final comment.

Calculations:

Pearson’s correlation is most commonly, we can find a correlation matrix in python as follows:

data.corr() # data is DataFrame of SNACKS data

Correlation matrix

a) Spearman’s correlation coefficients-

Let’s calculate Spearman’s correlation coefficients(rs) for our “SNACKS” data.

We’ve our SNACKS dataset stored in “data” DataFrame.

We can see it’s first 5 rows using:

data.head()

We can try plotting feature vs target variable,

For example, the scatterplot of Saltiness vs Liking scores, all plots are very complicated and we can’t say if there can be any correlation between them.

Now, let’s calculate rs,

By using scipy.stats.spearmanr of Python we can calculate spearman’s correlation matrix.

But as we’re only interested in rs of pairs between features and the target variable. For simplification and more details explanation, let’s calculate separately one-by-one.

i) Saltiness- Liking scores

We can use “Saltiness” and “Liking scores” columns, computing their ranks and then following the procedure discussed in the theory part:

Here, printing the first 5 rows of the table:

So we’ve got, d^2 = 2756.25 + 4.00 + 361.00 + 4.00 + 2450.25 = 113467.0

n = 100

Therefore,

rs = 1–6*2450.25100(1002–1)= 0.3191299 ≈ 0.319

Similarly, we can calculate for all other pairs.

ii) Sweetness- Liking scores

d^2 = 149718.5

n = 100

rs = 0.101599159915 ≈ 0.102

iii) Acidity- Liking scores

d^2= 161404.5

n = 100

rs = 0.0314761476 ≈ 0.031

iv) Crunchiness- Liking scores

d^2 =81737.0

n = 100

rs = 0.5095289528 ≈ 0.509

So finally we’ve Spearman’s correlation coefficient(rs) for different pairs as given below:

b) Coefficient of Determination(R2):

The coefficient of determination is used to explain how much variability of one factor can be caused by its relationship to another factor.

Since R2 = rs*rs

Coefficient of Determination for:

  • Saltiness- Liking scores = 0.319*0.319 = 0.102
  • Sweetness- Liking scores = 0.102*0.102 = 0.010
  • Acidity- Liking scores = 0.031*0.031= 0.00099
  • Crunchiness- Liking scores = 0.509*0.509 = 0.259

c) Interpreting the results from (a) and (b):

From (a), from calculated values of rs we can say that “Saltiness” and “Crunchiness” are fairly rank-correlated(fair monotonic relation) to “Liking scores”, while “Sweetness” and “Acidity” are not significantly correlated.

From (b), as we’ve calculated Coefficient of Determination(R2), we can say that level of variance in the dependent variable caused by its relationship with the independent variable is higher in case of “Saltiness” and “Crunchiness” compared to “Sweetness” and “Acidity”.

d) Significance Test:

We can use Spearman’s coefficient as a statistical method for proving or disproving a hypothesis.

C.I = 95%

So α = 5% = 0.05(Two-tail test)

Hypothesis:

H0: The variables do not have a rank-order relationship in the data.

To reject H0: is to say that there is a rank-order relationship between the variables in the data.

N = 100

Degree of Freedom(DOF) = 100–2 = 98

α = 0.05

From the Spearman’s rank correlation coefficient graph and table we can find Spearman’s coefficient as 0.199.

i) Saltiness- Liking scores

rs = 0.319

and Spearman’s rank correlation coefficient from Spearman’s rank significance table is 0.199.

as 0.319 > 0.199, we reject the hypothesis, i.e. there is a greater than 95% chance that the relationship is significant(not random) among “Saltiness” and “Liking scores” attributes.

Similarly, we can test for remaining pairs

ii) Sweetness- Liking scores( rs = 0.102)

we fail to reject the hypothesis, the variables do not have a significant rank-order relationship in the data.

iii) Acidity- Liking scores( rs = 0.031)

we fail to reject the hypothesis, the variables do not have a significant rank-order relationship in the data.

iv) Crunchiness- Liking scores(rs = 0.509)

we reject the hypothesis, i.e. i.e. the relationship is significant(not random).

Experimental Results:

  • As rs values for Sweetness-Liking scores and Acidity-Liking scores are very low(nearer to 0), we can conclude that Sweetness and Liking scores are not much rank-correlated, same for the Acidity- Liking scores pair. As rs values for Saltiness and Liking scores, Crunchiness and Liking scores are nearer to 0.5, attributes are fairly rank-correlated. As rs value for Crunchiness and liking scores is relatively higher, these variables are more correlated any than others. Similarly, a measure of the variability of one factor can be caused by its relationship to another factor is in the order:
  • Crunchiness-Liking scores > Saltiness-Liking scores > Sweetness-Liking scores > Acidity-Liking scores.

i.e. “Liking scores” can be calculated from “Crunchiness” with fewer errors compared to other features.

  • From Significance test:

i)We reject the hypothesis in Saltiness-Liking scores and Crunchiness-Liking scores. We can conclude that there is a significant relationship(i.e not random) between Saltiness & Liking scores, and between Crunchiness & Liking scores.

ii) In Sweetness-Liking scores and Acidity-Liking scores, we fail to reject the hypothesis(H0) and conclude that “Sweetness” and “Liking scores” do not have a significant rank-order relationship, same in the case of “Acidity” and “Liking scores”.

Here is the Github repository link to the data, and code:

https://github.com/nraghute/Data-Science/tree/master/Spearman's%20rank%20correlation%20coefficient

THANK YOU!

--

--