Correlation in Statistics.

Swapnil Bandgar
Analytics Vidhya
Published in
7 min readMay 29, 2021

Correlation is a term that is a measure of the strength of a linear relationship between two quantitative variables (e.g., height, weight).

Sometimes two or more events are interrelated, i.e., any change in an event may affect the other events. If such changes are expressed in the form of numerical data and they appear to be interdependent they are said to be correlated. For example, the weight of human body increases with the increase in height and age. Here the age and body weight are two separate characters but they are interdependent so they are correlated. If two or more variables are so related that an increase in one variable may cause change in the other variables they are said to be correlated.

Correlation Coefficient

The correlation coefficient, r, is a summary measure that describes the extent of the statistical relationship between two interval or ratio level variables. The correlation coefficient is scaled so that it is always between -1 and +1. When r is close to 0 this means that there is little relationship between the variables and the farther away from 0 r is, in either the positive or negative direction, the greater the relationship between the two variables.

Types of Correlation:

Depending upon the direction and proportion of changes in the variables and the number of data series, the correlation may be of the following types:

1) Positive and negative correlations.

2) Linear and non-linear correlations.

3) Simple, multiple, and partial correlations.

1) Graphic method:

When the values of dependent series are plotted on O-X axis and independent series are plotted on O-Y axis of graph paper, a linear or non-linear graph will be acquired which will simply give us the direction of correlation.

If the graph lines of two independent series move in upward direction from left to right, the correlation is positive, but if the graph line of one series moves upward from left to right and that of the other independent series moves downward from left or right, they show negative correlations.

2) Scatter diagram or Dotogram method.

This is graphical method, in which the values of independent data series are plotted on O-X axis and those of dependent series on O-Y axis and then the pairs of values are plotted on the graph paper.

In this ways, graphs of dots are obtained for different data points. These dots are scattered in different forms. Thus, the graphs are called scatter diagrams.

The more the points plotted are scattered over the chart, the lesser is the degree of correlation between the variables. The more then points plotted are closer to the line, the higher is the degree of correlation. The degree of correlation is denoted by “r”.

The following types of scatter graphs illustrate about the degree of correlation between variable X and variable Y.

1) Perfect Positive Correlation (r=+1):

Perfectly positive correlation said to be true when all the points lie on the straight line rising from the lower left-hand corner to the upper right-hand corner.

2) Perfect Negative Correlation (r=-1):

Perfectly negative correlation said to be true when all the points lie on a straight line falling from the upper left-hand corner to the lower right-hand corner, the variables are said to be negatively correlated.

3) High Degree of positive Correlation (r= +High):

The degree of correlation is high when the points plotted fall under the narrow band and is said to be positive when these show the rising tendency from the lower left-hand corner to the upper right-hand corner.

4) High Degree of negative Correlation (r= -High):

The degree of negative correlation is high when the point plotted fall in the narrow band and show the declining tendency from the upper left-hand corner to the lower right-hand corner.

5) Low Degree of positive Correlation (r= +Low):

The correlation between the variables is said to be low but positive when the points are highly scattered over the graph and show a rising tendency from the lower left-hand corner to the upper-right-hand corner.

6) Low Degree of negative Correlation (r= -Low):

The degree of correlation is low and negative when the points are scattered over the graph and the show the falling tendency from the upper left-hand corner to the lower right-hand corner.

7) No Correlation(r=0):

When the points are haphazardly scattered over the graph and do not show any specific pattern. We can say variables also unrelated with each other. Here the correlation is absent and hence r = 0. The variables x and y are said to be independent.

Spearman’s ranking method:

Professor Charls Spearman worked out a method for determining correlation in which the values of all data of a series are assigned ranks in decreasing or increasing (ascending) order. In this ranking process, the highest value is given rank 1 and the next higher value is given rank 2 and so on. In some series the values of two or more data are similar.

The difference between the ranks (D) of respective data of the two series arc obtained (D = R1-R2) which may be positive or negative figures. Then after, the values of D2 and sum of D2 (= ∑D2) are determined.

Karl Pearson’s method:

Karl Pearson’s co-efficient of correlation is the best measure for expressing the relationship between two variables. The degree and direction of the relationship between the variables can be obtained by it. However, the following are some of the limitations of it.

(1) If is based on the assumption of linearity of relationship between the variables.

(2) The computation by this method is difficult compared to other methods.

(3) The correlation co-efficient is highly influenced by extreme pairs of observations.

(4) It is always difficult to interpret the correlation co-efficient, correctly.

Pearson’s method popularly known as a Peasonian Coefficient of Correlation, is the most extensively used quantitative methods in practice. The coefficient of correlation is denoted by “r”.

Coefficient of correlation by concurrent deviation:

This method is used to indicate whether the correlation is in positive or negative direction especially in the data series characterized by short-term fluctuations of data.

· The direction of deviation [positive (+) or negative (-)] of each observation in respect of preceding data are marked for different series in separate columns.

· The deviation signs of respective data of the two series are multiplied (+ x + = +, + X — = — and — X — = +) and the products are recorded in a separate column.

· The total number of positive signs in the column for product of deviation signs is recorded which is called concurrent deviation (= C)

· The coefficient of correlation (RC) by concurrent deviation is determined by the following formula:

Where, C = total number of + signs in the column for products of two deviations and N = number of observations in a series.

Methods for Regression Line: -

Regression lines are useful in predicting procedures. Its purpose is to describe the interrelation of the dependent variable with one or more independent variables.

Using the equation obtained from the regression line act as an analyst who can predict future behaviors of the dependent variables by inputting different values for the independent ones.

We can calculate slope and constant by below formula:

Reference: Scribbr, Mathsisfun

--

--

Swapnil Bandgar
Analytics Vidhya

Code is like humor. When you have to explain it, it’s bad. Connect with me on LinkedIn : https://www.linkedin.com/in/imswapnilb