Spearman’s Correlation

Swapnilbobe
Analytics Vidhya
Published in
3 min readMar 7, 2021

--

  • Spearman’s Correlation is the feature selection method.
  • Spearman’s Correlation determines the strength and direction of the monotonic relationship between your two variables.

What is a Monotonic Relationship?

  1. when the value of one variable increases the values of another variable is also increases or vice versa but not in a linear manner.
  2. look at the below image for more understanding.

Mathematics Behind Spearman’s Correlation

  • Spearman’s Correlation is based on the rank of variables.
  • we need to set the rank for each variable.
  • Consider the following example.

Example:

  • we are creating a student dataset that contains marks of the students.
english = np.array([67,89,88,90,95])
maths = np.array([77,86,98,95,87])

d = {'english':english, 'maths':maths}
  • using the above dictionary we are creating a pandas dataframe.
data = pd.DataFrame(d)
data
  • Below is our student's datafram.
Dataset
  • Now, we have to assign a rank to each variable on the basis of their increasing order.
  • So, we have created a rank for each column.
english_rank = np.array([1,3,2,4,5])
maths_rank = np.array([1,2,5,4,3])
  • adding these ranks to our data frame.
data['english_rank'] = english_rank
data['maths_rank'] = maths_rank
data
  • This is our final data frame.
Dataset with ranks
  • sees the above table to understand how we have ranked the variables.
  • the dataset we have considered does not have duplicated values so here we can use the formula 1.

Formula:

  • there are two formulas to calculate the spearman’s correlation.
  1. If there are no duplicates in the dataset then we use the following formula:

where ‘di’ is the difference between ranks and ‘n’ is the total number of observations.

2. If there are duplicates in the dataset then we use the following formula:

Calculating the Spearman's correlation:

  • first, we need to calculate d and d2.
data['d'] = data['english_rank'] -data['maths_rank']
data['d2'] = data['d']**2
  • we have calculated the d and d2
data
dataset with d and d2

Here, we are calculating spearman’s correlation using the first formula.

sc = 1 - (6*data['d2'].sum() / ( len(data.index) * ( len(data.index)**2  -1)) )
  • variable ‘sc’ stores the spearman's correlation score.
# sc gives the score of relationship between ranks of two individual features.
sc
output :0.30000000000000004

Implementation using Scipy library:

  • we can use spearman's correlation from the Scipy module.
  • we have imported the spearmanr from scipy. stats module and also imported the SelectKBest class.
# SelectKBest is used to select k best features.

from sklearn.feature_selection import SelectKBest
from scipy.stats import spearmanr
  • SelectKBest used to select k best features on the basis of classifier score. (here our classifier is spearmanr)
skb = SelectKBest(score_func=spearmanr, k=1)
  • splitting our data to X and y.
  • here, X is the input variable and y is the output variable.
X = data[['english']]
y = data['maths']
  • fitting our model to X and y.
skb.fit(X, y)output:SelectKBest(k=1, score_func=<function spearmanr at 0x7f3d563c15f0>)
  • Now, here the spearman's correlation score.
skb.scores_output: array(0.3)
  • we can see that the Score we have calculated earlier using the formula is almost the same as the score we calculated using the spearmanr method.
  • Here is my complete notebook on Spearman’s Correlation. Click here

Summary:

  • we have understood that what is Spearman’s Correlation.
  • also understood that when and how to use it

--

--

Swapnilbobe
Analytics Vidhya

Python Developer, Data Science Enthusiast, Exploring in the field of Machine Learning and Data Science. https://www.linkedin.com/in/swapnil-bobe-b2245414a/