Pearson’s Correlation

Swapnilbobe
Analytics Vidhya
Published in
5 min readMar 6, 2021

--

  • Pearson’s Correlation is the Feature Selection Method.
  • It shows direction and strength between dependant and independent variables.
  • This method best suited when there is a linear relation between dependant and Independent.
  • Its value ranges from -1 to 1.
  1. -1 means there is strong -ve relation between dependant and independent.
  2. 0 means there is no relation between dependant and independent at all.
  3. 1 means there is strong +ve relation between dependant and independent.
  4. when the value of correlation is going closer to 0 then the relation becoming weaker.

Mathematics Behind Pearson’s Correlation

  • Pearson’s correlation is the ratio of covariance between x and y to the product of standard deviation of x and y.
  • first, we need to understand what is variance, standard deviation, and covariance.
  • below is the formula for Pearson's correlation.

1) Variance:

  • variance measures the variation of data from its mean.

2) Standard Deviation:

  • Standard Deviation is a square root of variance.

3) Covariance:

  • Covariance measures the linear relationship between variables.
  • Let's start the implementation of Pearson's correlation.
# importing librariesimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
  • following are the 3 ways to identify the relationship between the dependant and independent variables.

1) if X is increasing and y is also increasing:

  • we can say that there is strong +ve relation between dependant and independent variables.

2) if X is increasing and y is constant:

  • we can say that there is no relation at all between dependant and independent variables.

3) if X is decreasing and y is increasing:

  • we can say that there is strong -ve relation between dependant and independent variables.

Data Creation

  1. we are creating some raw data to understand the relation between the two.
  2. we are also going to plot linear graphs for better understanding.
# X is increasing and y is also increasing 
X1 = [1,2,3,4,5]
y1 = [1,2,3,4,5]
# if X is increasing and y is constant
X2 = [1,2,3,4,5]
y2 = [1,1,1,1,1]
# if X is decreasing and y is increasing
X3 = [5,4,3,2,1]
y3 = [1,2,3,4,5]

Now, visualize the above data

# plot for X is increasing and y is also increasing
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
plt.plot(X1, y1)
plt.title('Fig: 1.1 \nincreasing X, y increasing ')
plt.xlabel('+ ve correlation')
plt.ylabel('increasing value of y')
# plot for if X is increasing and y is constant
plt.subplot(1,3,2)
plt.plot(X2, y2)
plt.title('Fig: 1.2 \nincreasing X, y constant ')
plt.xlabel('no correlation')
plt.ylabel('constant value of y')
# plot for if X is decreasing and y is increasing
plt.subplot(1,3,3)
plt.plot(X3, y3)
plt.title('Fig: 1.3, \ndecreasing X, y increasing ')
plt.xlabel('-ve correlation')
plt.ylabel('increasing value of y')

plt.show()
  • In fig 1.1, we can observe that if the value of X is increasing the value of y is also increasing it means that there strong +ve correlation between these two.
  • In fig 1.2, we can observe that if the value of X is increasing the value of y is constant it means that there no correlation at all.
  • In fig 1.3, we can observe that if the value of X is decreasing the value of y is increasing it means that there a strong -ve correlation between these two.

Now, It's time for applying Pearson’s Correlation.

# import dataset
# here we are using housing dataset.
train_data = pd.read_csv('/content/drive/MyDrive/My Datasets/House Price/train.csv')train_data.shapeoutput:(1460, 81)
  • Here, we are not doing any kind of feature engineering, so we are selecting only integer columns and dropping rows that have null values for applying Pearson's correlation.
# getting only integer columns.

train_data.drop(train_data.select_dtypes(include='object').columns, inplace=True, axis=1)
# dropping columns which have missing value percentage greater than 60

train_data.drop(train_data.columns[train_data.isnull().mean()>0.60], inplace=True, axis=1)
train_data.shape
output:(1460, 38)
  • This heatmap shows how many null values are present in the dataset.
plt.figure(figsize=(20, 7))
sns.heatmap(train_data.isnull(), yticklabels=False, cbar=False)
plt.show()
heatmap with null values
# dropping rows which have missing values.train_data.dropna(inplace=True, axis=0)train_data.shapeoutput:(1121, 38)plt.figure(figsize=(20, 7))
sns.heatmap(train_data.isnull(), yticklabels=False, cbar=False)
plt.show()
heatmap with no null values
  • we have cleaned our dataset till now. Time to use Pearson's correlation.
from sklearn.feature_selection import f_regression, SelectKBest

# f_regression method used for pearson's correlation.
# SelectKBest method used to select top k best features.
  • Assigning dependent variables to variable X and.
  • Assigning an independent variable to variable y.
X = train_data.drop(['SalePrice'], axis=1)
y = train_data['SalePrice']
  • here we are creating an object of SelectKBest class with the score function f_regression with k value 10.
skb = SelectKBest(score_func=f_regression, k=10)
  • Now, time to fit our model using variables X and y.
# fit meathod used to fit model on dataset using our score function.

skb.fit(X, y)
  • above code returns below output.
SelectKBest(k=10, score_func=<function f_regression at 0x7fded4e704d0>)
  • get_support() method returns the boolean list of True and False values. we can use this list to get our columns from the dataset. when True it considers the column otherwise it does not consider a column to return.
# get_support() returns the boolean list of columns .

col = skb.get_support()
col
output:array([False, False, False, False, True, False, True, True, False,
False, False, False, True, True, False, False, True, False,
False, True, False, False, False, True, False, False, True,
True, False, False, False, False, False, False, False, False,
False])
  • get_support(indices=True) returns the list of integers which denotes the number (position) of a particular column.
# get_support(indices=True) returns the list of k columns indices which have high pearson's correlations.

col = skb.get_support(indices=True)
col
output:
array([ 4, 6, 7, 12, 13, 16, 19, 23, 26, 27])
# scores_ returns correlation values of every feature.
skb.scores_
output:array([2.49023403e+00, 8.73950826e+00, 1.50458329e+02, 1.10639690e+02,
1.96036658e+03, 1.75866077e+01, 4.26662160e+02, 4.17465247e+02,
3.51021787e+02, 2.01096191e+02, 8.79325850e-01, 5.32479989e+01,
6.82869769e+02, 6.56137887e+02, 1.16337572e+02, 2.45763523e-03,
1.10672093e+03, 6.64373624e+01, 1.49381397e+00, 5.29173583e+02,
8.69809362e+01, 3.20295557e+01, 2.25333340e+01, 4.77935018e+02,
3.03445055e+02, 3.82561158e+02, 8.05838393e+02, 6.96288859e+02,
1.43226582e+02, 1.49551920e+02, 2.74886917e+01, 1.06092031e+00,
1.38136215e+01, 9.65457023e+00, 1.45543888e+00, 2.98365208e+00,
1.57654584e-01])
# this is our final dataset after using pearson's correlationX.iloc[:, col]
  • Here is my complete notebook on Pearson's Correlation. Click here

Summary:

  • we have learned how to use Pearson’s Correlation and also how to implement using the Sklearn library.
  • we have also seen how to use the SelectKBest method to select the K feature from a dataset.

--

--

Swapnilbobe
Analytics Vidhya

Python Developer, Data Science Enthusiast, Exploring in the field of Machine Learning and Data Science. https://www.linkedin.com/in/swapnil-bobe-b2245414a/