Pearson’s Correlation

Detailed information and calculation of Pearson’s Correlation using Excel, Python, R and SPSS

Suresha HP
Analytics Vidhya
8 min readFeb 6, 2021

--

Image ref

What is Pearson Correlation?

Pearson Correlation or Pearson Product Moment Correlation of (PPMC) or Bivariate correlation is the standard measure of correlation in statistics. It shows the linear relation between two sets of data. It answers the question in simple terms: can I draw a line graph to represent the data?

The Pearson correlation is expressed by two letters: the Greek letter rho (ρ) for a population, and the letter “r” for a study.

To find the relationship between variables in the data, correlation coefficient formulas are used. The formulas return a value ranging from -1 to 1, where:

1 implies a good relationship that is optimistic.
A clear negative relationship is indicated by -1.
No relationship at all implies a consequence of zero.

Different correlation . Ref

A coefficient of correlation of 1 means that there is a positive increase of a fixed proportion in the other variable for every positive increase in one variable.

Zero means that there isn’t a positive or negative increase with any rise. The two are clearly not related.

A correlation coefficient of -1 means that there is a negative decrease in a specified proportion in the other variable for every positive increase in one variable. coefficient of correlation in absolute value gives us the power of the relationship. The bigger the number, the stronger the partnership. |-.75| = .75, for instance, which has a better relationship than .65.

Correlation doesn't Imply Causation!

Correlation V/s Causation
Correlation V/s causation
Correlation V/S Causation

Correlation is the degree to which there is a linear correlation between two variables. In bi-variate data analytics, this is an important step. Any statistical association, causal or not, between two random variables in bivariate data is basically the broadest meaning correlation.
An significant rule to note is that there is no cause suggested by correlation.
Let’s understand two examples of what it really means.
Ice-cream consumption increases during the summer months. A close correlation exists between the sales of ice-cream units. In this particular case, we see a causal correlation, as the intense summers push up the sale of ice creams.
In this specific case, as the intense summers drive up the selling of ice creams, we see a causal link. Sales of ice-creams also have a clear connection with attacks by sharks.
As we can see clearly here, the shark attacks are most definitely not caused due to ice-creams. So, there is no causation here.
Hence, we can understand that Correlation doesn’t ALWAYS imply causation!

Inference of Pearson’s correlation

only a linear relationship between two continuous variables can be tested by the Pearson correlation (A relationship is linear only when a change in one variable is associated with a proportional change in the other variable)
For example, the Pearson correlation may be used to determine whether an increase in age contributes to an increase in blood pressure.
An example of how the Pearson coefficient of correlation (r) varies with the intensity and the direction of the relationship between the two variables is given below. Note that the Pearson coefficient yields a value of zero when no linear relationship can be formed (refer to the graphs in the third column).

What’s the correlation of coefficient?

A statistical estimate of the frequency of the relationship between the relative movements of two variables is the coefficient of correlation. The values vary between -1.0 and 1.0.0, respectively. There is a perfect negative correlation with a correlation of -1.0, while a correlation of 1.0 indicates a perfect positive correlation. A correlation of 0.0 indicates no linear relation between the two variables’ motion.

Reference from Youtube

Scatterplots

To take the first look to our dataset, a good way to start is to plot pairs of continuous variables, one in each coordinate. Each point on the graph corresponds to a row of the dataset.

Scatterplots give us a sense of the overall relationship between two variables:

  • Direction: positive or negative relation, when one variable increases the second one increases or decreases?
  • Strength: how much a variable increases when the second one increases.
  • Shape: The relation is linear, quadratic, exponential…?

Using scatterplots is a fast technique for detecting outliers if a value is widely separated from the rest, checking the values for this individual will be useful.

We will go with the most used data frame when studying machine learning, Iris, a dataset that contains information about iris plant flowers, and the objective of this one is to classify the flowers into three groups: (setosa, versicolor, virginica).

Scatter plot of two iris dataset variables

The objective of the iris dataset is to classify the distinct types of iris with the data that we have, to deliver the best approach to this problem, we want to analyze all the variables that we have available and their relations.

In the last plot we have the petal length and width variables, and separate the distinct classes of iris in colors, what we can extract from this plot is:

  • There’s a positive linear relationship between both variables.
  • Petal length increases approximately 3 times faster than the petal width.
  • Using these 2 variables the groups are visually differentiable.

Scatter Plot Matrix

To plot all relations at the same time and on the same graph, the best approach is to deliver a pair plot, it’s just a matrix of all variables containing all the possible scatterplots.

As you can see, the plot of the last section is in the last row and third column of this matrix.

Pair plot of two iris dataset variables

In this matrix, the diagonal can show distinct plots, in this case, we used the distributions of each one of the iris classes.

Being a matrix, we have two plots for each combination of variables, there’s always a plot combining the same variables inverse of the (column, row), the other side of the diagonal.

Using this matrix we can obtain all the information about all the continuous variables in the dataset easily.

Pearson Correlation Coefficient

Scatter plots are an important tool for analyzing relations, but we need to check if the relation between variables is significant, to check the lineal correlation between variables we can use the Person’s r, or Pearson correlation coefficient.

The range of the possible results of this coefficient is (-1,1), where:

  • 0 indicates no correlation.
  • 1 indicates a perfect positive correlation.
  • -1 indicates a perfect negative correlation.

To calculate this statistic we use the following formula:

Pearson’s correlation formula

Test significance of correlation coefficient

We need to check if the correlation is significant for our data, as we already talked about hypothesis testing, in this case:

  • H0 = The variables are unrelated, r = 0
  • Ha = The variables are related, r ≠ 0

This statistic has a t-student distribution with (n-2) degrees of freedom, being n the number of values.

The formula for the t value is the following, and we need to compare the result with the t-student table.

Pearson’s correlation t-student formula

If our result is bigger than the table value we reject the null hypothesis and say that the variables are related.

Coefficient of determination

To calculate how much the variation of a variable can affect the variation of the other one, we can use the coefficient of determination, calculated as the . This measure will be very important in regression models.

How to perform Pearson’s correlation in Excel:

How to perform Pearson’s correlation in Python:

How to perform Pearson’s correlation in R:

How to perform Pearson’s correlation in SPSS:

Advantages and Disadvantages of Pearson’s correlation:

Advantages:

  1. This method indicates the presence or absence of correlation between any two variables and determines the exact extent or degree to which they are correlated.
  2. Under this method, we can also ascertain the direction of the correlation, i.e., whether the correlation between the two variables is positive or negative.
  3. This method enables us in estimating the value of a dependent variable regarding a particular value of an independent variable through regression equations.
  4. This method has many algebraic properties for which the calculation of the coefficient of correlation, and a host of other related factors viz. co-efficient of determination, are made easy.

Disadvantages:

  1. It is comparatively difficult to calculate as its computation involves intricate algebraic methods of calculations.
  2. It is very much affected by the values of the unnecessary items.
    It is based on a large number of assumptions viz. linear relationship, cause and effect relationship, etc., which may not always hold good.
  3. It is very much likely to be misinterpreted particularly in the case of homogeneous data.
  4. In comparison to the other methods, it takes much time to arrive at the results.
  5. It is subject to probable error, which its profounder himself admits, and therefore, it is always advisable to compute it probable error while interpreting its results.

Conclusion:

In this article i tried to collect all the information about Pearson’s correlation , uses, theory and application using different tools.

Connect with me through Linkedin and Medium for new articles and blogs.

— — — *— — — * — — — * — — — * — — — *— — — *— — — * —

“Develop a passion for learning. If you do, you will never cease to grow” Anthony J. D’Angelo

— — — * — — — * — — — * — — — * — — — * — — — * — — — * —

--

--

Suresha HP
Analytics Vidhya

Machine Learning & Artificial Intelligence Developer, Researcher with over 17+ years experience in different sectors and industry