Spearman’s Correlation
Published in
3 min readMar 7, 2021
- Spearman’s Correlation is the feature selection method.
- Spearman’s Correlation determines the strength and direction of the monotonic relationship between your two variables.
What is a Monotonic Relationship?
- when the value of one variable increases the values of another variable is also increases or vice versa but not in a linear manner.
- look at the below image for more understanding.
Mathematics Behind Spearman’s Correlation
- Spearman’s Correlation is based on the rank of variables.
- we need to set the rank for each variable.
- Consider the following example.
Example:
- we are creating a student dataset that contains marks of the students.
english = np.array([67,89,88,90,95])
maths = np.array([77,86,98,95,87])
d = {'english':english, 'maths':maths}
- using the above dictionary we are creating a pandas dataframe.
data = pd.DataFrame(d)
data
- Below is our student's datafram.
- Now, we have to assign a rank to each variable on the basis of their increasing order.
- So, we have created a rank for each column.
english_rank = np.array([1,3,2,4,5])
maths_rank = np.array([1,2,5,4,3])
- adding these ranks to our data frame.
data['english_rank'] = english_rank
data['maths_rank'] = maths_rank
data
- This is our final data frame.
- sees the above table to understand how we have ranked the variables.
- the dataset we have considered does not have duplicated values so here we can use the formula 1.
Formula:
- there are two formulas to calculate the spearman’s correlation.
- If there are no duplicates in the dataset then we use the following formula:
where ‘di’ is the difference between ranks and ‘n’ is the total number of observations.
2. If there are duplicates in the dataset then we use the following formula:
Calculating the Spearman's correlation:
- first, we need to calculate d and d2.
data['d'] = data['english_rank'] -data['maths_rank']
data['d2'] = data['d']**2
- we have calculated the d and d2
data
Here, we are calculating spearman’s correlation using the first formula.
sc = 1 - (6*data['d2'].sum() / ( len(data.index) * ( len(data.index)**2 -1)) )
- variable ‘sc’ stores the spearman's correlation score.
# sc gives the score of relationship between ranks of two individual features.
scoutput :0.30000000000000004
Implementation using Scipy library:
- we can use spearman's correlation from the Scipy module.
- we have imported the spearmanr from scipy. stats module and also imported the SelectKBest class.
# SelectKBest is used to select k best features.
from sklearn.feature_selection import SelectKBest
from scipy.stats import spearmanr
- SelectKBest used to select k best features on the basis of classifier score. (here our classifier is spearmanr)
skb = SelectKBest(score_func=spearmanr, k=1)
- splitting our data to X and y.
- here, X is the input variable and y is the output variable.
X = data[['english']]
y = data['maths']
- fitting our model to X and y.
skb.fit(X, y)output:SelectKBest(k=1, score_func=<function spearmanr at 0x7f3d563c15f0>)
- Now, here the spearman's correlation score.
skb.scores_output: array(0.3)
- we can see that the Score we have calculated earlier using the formula is almost the same as the score we calculated using the spearmanr method.
- Here is my complete notebook on Spearman’s Correlation. Click here
Summary:
- we have understood that what is Spearman’s Correlation.
- also understood that when and how to use it