All About the Pearson Correlation Coefficient in Data Science

This blog aims to explain an effective way to calculate the correlation between the features of a dataset which in turn will help to not only select specific features to improve the model training(remove the curse of dimensionality), but it will also help in improving the model performance.

Harshit Dawar
The Startup
4 min readNov 21, 2020

--

Source

In every data science project, Feature Engineering is a very important aspect that needs to be done in order to make an effective model. In any Data Science project, it is very important to select minimum features that are relevant to the target variable/output.

For Feature Selection, there are various techniques, among those techniques, finding correlation is very famous & widely adopted. Finding a correlation between the features of the dataset is a very interesting and important aspect.

I would request to all the readers of this blog, please read my blog on Covariance(if you haven’t already), it will build you fundamentals on correlation, & it will also help you to understand the drawbacks of Covariance which leads us to use Pearson Correlation. The link for the blog is mentioned below.

Pearson Correlation Coefficient!

It is just a number that ranges between -1 & 1, but it is very helpful & it clearly helps in selecting the relevant features from the dataset.

It uses the concept of Covariance but it also uses 2 additional terms to provide the strength between the features or how strongly 2 features are correlated.

It is been calculated by the formula given below:

Pearson Coefficient Formula! [Image by Author]

“sigma” in the above image corresponds to the standard deviation.

Covariance was only able to provide us the direction in which the features are correlated, i.e. positive, negative, or zero correlation. But, Pearson correlation has the capability to provide insight on how strongly the 2 features are correlated.

Understanding the Pearson Coefficient Value!

The value of the Pearson Correlation Coefficient varies between -1 & 1, where 0 means there is no correlation, the value which is towards -1, means that the features are negatively correlated, & the value towards means that the features are positively correlated.

An example of each part with a diagram has been shown below (Link to the code to create the synthetic data for these graphs is present at the end of this blog).

Negative Correlation!

Pearson Correlation Coefficient = -1.0 [Image by Author!]

The above image illustrates the arrangements of the values of the features which can result in a highly negative correlation, which means that with the increase in the value of one feature, there will be a strict decrease in the value of the other feature.

Random Correlation between -1 & 1!

Pearson Correlation Coefficient = 0.59 [Image by Author!]

The image above illustrates the values of the features which can produce a positive correlation, i.e. with the increase in the value of one feature, there will be an increase in the value of another feature.

Positive Correlation!

Pearson Correlation Coefficient = 1.0 [Image by Author!]

The image above illustrates the values of the features which produces a highly positive correlation between them, i.e. with the increase in the value of one feature, there will be a strict increase in the value of another feature.

Advantages of the Pearson Correlation Coefficient!

  • It provides insight into how strongly the features are correlated.
  • Highly effective in selecting features.
  • Very easy to calculate.

Significance of the Pearson Correlation Coefficient!

After calculating the Pearson correlation coefficient, it can be decided which features to use while training the Model.

For example, if two features are having a very strong correlation let’s say their Pearson correlation coefficient is 1.0 or 0.99 or close to 1.0, one feature among them can be dropped because they are exactly the same features logically.

I hope my article explains each and everything related to the topic with all the deep concepts and explanations. Thank you so much for investing your time in reading my blog & boosting your knowledge. If you like my work, then I request you to give an applaud to this blog!

--

--

Harshit Dawar
The Startup

AIOPS Engineer, have a demonstrated history of delivering large and complex projects. 14x Globally Certified. Rare & authentic content publisher.