Pearson Correlation, a Mathematical Understanding!

Vinay Singh
6 min readJan 18, 2019

--

I have always noticed that just getting an Idea of a method without having an understanding of the mathematics behind, make things a lot more difficult than we think. And understanding the concept from somewhere and trying to find some mathematical examples about the method is time-consuming as well as frustrating sometimes (believe me it is :P).

Hence here I’m trying to bring both the aspects on the same page as well as some examples of correlations linked at the end of the article.

So First questions first, WHAT IS CORRELATION?

It’s a measure of how well two variables are related to each other. There are positive as well as negative correlation.

Positive Correlation: It refers to the extent to which the two variables increases or decreases in parallel ( think of this as directly proportional*, one increases other will increase, one decreases other will follow the same).

Negative Correlation: It refers to the extent to which one of the two variables increases as the other decreases (think of this as inversely proportional*, one increases other will decrease or if one decreases other will increase).

Data representation in correlation cases

*the proportionality thought above is used to just to give you a feeling of the relation of two variables.

The most common correlation in statistics is the Pearson correlation.

WHAT IS PEARSON CORRELATION?

The full name is the Pearson Product Moment Correlation (PPMC). In layman terms, it’s a number between “+1” to “-1” which represents how strongly the two variables are associated. Or to put this in more simple words, it states the measure of the strength of linear association between two variables.

But what does it really represents mathematically?

Basically, a Pearson Product Moment Correlation (PPMC)attempts to draw a line to best fit through the data of the given two variables, and the Pearson correlation coefficient “r” indicates how far away all these data points are from the line of best fit.

The value of “r” ranges from +1 to -1 where:

  • r= +1/-1 represents that al our data points lie on the line of best fit only i.e there is no data point which shows any variation from the line of best fit.
Data points with r=1
  • Hence, the stronger the association between the two variables, the closer r will be to +1/-1.
  • r = 0 means that there is no correlation between the two variables.
  • The values of r between +1 and -1 indicate that there is a variation of data around the line.
  • The closer the values of r to 0, the greater the variation of data points around the line of best fit.
Data points with -1 < r < +1

WHAT TYPE OF VARIABLES CAN WE USE?

We cannot use just any type of variable to calculate the Pearson Correlation, it does not work that way. The two variables have to be measured on either an interval or ratio scale. However, both variables do not need to be measured on the same scale (e.g., one variable can be ratio and one can be an interval). Along with this, there is no restriction of units in which the two variables are measured. For example, you could correlate a person’s age with their blood sugar levels. Here, the units are completely different; age is measured in years and blood sugar level measured in mmol/L (a measure of concentration).

It is also important to realize that the value of Pearson’s coefficient, r, is not a measure of the slope of the line (i.e the line of best fit). We can see an example in the plot above with r=1.

Formula of Pearson Correlation coefficient:

An example with calculating Pearson Coefficient:

Find the value of the correlation coefficient from the following table:

Age and Glucose levels of 6 subjects

We’ll calculate the value of r using the formula mentioned above. For using that formula we need to compute Σ(X*Y), Σ(X), Σ(Y), Σ(X²), Σ(Y²).

The table below shows the computed values of all the summations mentioned above.

From our table we get:

  • Σ(X) = 247
  • Σ(Y) = 486
  • Σ(X*Y) = 20,485
  • Σ(X²) = 11,409
  • Σ(Y²) = 40,022
  • n is the sample size, in our case = 6

r = 6(20,485) — (247 × 486) / [√[[6(11,409) — (24⁷²)] × [6(40,022) — 48⁶²]]]

r = 0.5298.

The range of the correlation coefficient is from -1 to +1. Our result is 0.5298 or 52.98%, which means the variables have a moderate positive correlation.

Problems with Pearson correlation? ( Potential)

The Pearson product-moment correlation does not take into consideration whether a variable has been classified as a dependent or independent variable. It treats all variables equally.

Example-1 If we are trying to find the correlation between a high-calorie diet and diabetes, we might find a high correlation of .8. However, we could also get the same result with the variables switched around. In other words, we could say that diabetes causes a high-calorie diet.

Example-2 We might want to find out whether basketball performance is correlated with a person’s height. We might, therefore, plot a graph of performance against height and calculate the Pearson correlation coefficient. Let's say, for example, that r = .67. That is, as height increases so do basketball performance. This makes sense. However, if we plotted the variables the other way around and wanted to determine whether a person’s height was determined by their basketball performance (which makes no sense), we would still get r = .67. This is because the Pearson correlation coefficient makes no account of any theory behind why you chose the two variables to compare. This is illustrated below:

Therefore, as a researcher, we have to be aware of the data we are plugging in.

Real Life Example

Pearson correlation is used in thousands of real-life situations. For example, scientists in China wanted to know if there was a relationship between how weedy rice populations are different genetically. The goal was to find out the evolutionary potential of the rice. Pearson’s correlation between the two groups was analyzed. It showed a positive Pearson Product Moment correlation of between 0.783 and 0.895 for weedy rice populations. This figure is quite high, which suggested a fairly strong relationship.

Some more examples of positive and negative correlation can be found below:

Images/Graphs Source: Google Images

References:

--

--

Vinay Singh

Software Engineer Goldman Sachs, India. Ms by Research in Computer Science from IIIT-Hyderabad.