Correlation

Plutobot
Analytics Vidhya
Published in
6 min readMar 19, 2020

Correlation explains how two variables are related to each other. This is an important statistical tool for bivariable analysis in data science. Correlation does not mean causation of these two variables explained, meaning one variable does not cause another and there could be another factor influence the variable. For instance, (Wikipedia example), during summer there is an increase in the sale of icecreams and people goto beaches and eaten by the sharks. This does not mean that eating icecreams causes sharks to eat people. Because several people visit the beaches, sharks eat them.

Correlation is better explained using a scatter plot.

Source

In the above diagram, the first scatter plot refers to a perfect positive correlation. Say, there are 2 variables and they are plotted on the X and Y-axis respectively. If X increases, Y also increases, thus they are positively correlated. The second scatter plot has a negative correlation, meaning when one variable increases, the other decrease. The third plot indicates there is no correlation because there is no coherence in the direction of the points. Therefore, the scatter plot determines the direction of the relationship between two variables. In data science projects, this will be useful to understand how 2 variables are correlated.

Correlation coefficient: Meanwhile, the scatter plot talks about the direction of the variables, we also must talk about the strength of the variables. This is possible using a numeric expression. Any correlation can be between 1 to -1. 1 being a strong positive correlation, -1 being strong negative correlation and any number closer to 0.5 will be weaker correlations of positive or negative magnitude. Lastly anything towards zero means, there is no correlation between the 2 variables in question.

Source

Correlation computation: The most common methods to compute the correlation coefficient (denoted by r) are Pearson’s correlation coefficient, Spearman’s correlation coefficient, Kendall’s rank correlation.

Important Side notes:

Before explaining how the correlation coefficient is computed, we need to understand the following concepts:

Probability Distribution: A probability distribution is a mathematical function that provides the probability of outcomes in an experiment. This provides all possible values within a range for a random variable. This distribution will have the following properties a) All probabilities will be between 0 and 1 b) Mutually Exclusive c) Collectively exhaustive with the sum of all probabilities equal to 1.

For example, if a die is rolled, the probability of getting any number from 1 to 6 is 1/6 => 0.1667. All outcomes 1 to 6 are mutually exclusive because when we roll a die, we can get any outcome within the range.

Sum of all the outcomes: the probability of rolling a 1 => 1/6, Rolling a 2 =>1/6, Rolling a 3=>1/6, Rolling a 4=>1/6, Rolling a 5=>1/6, Rolling a 6=>1/6. When we add all the outcomes => 1/6+1/6+1/6+1/6+1/6+1/6=>6/6=> 1

Source

There are 2 types of probability distributions -> Discrete and Continuous probability distributions. Discrete probability distribution when the variable is discrete as in the example provided above, rolling a die or tossing a coin. A probability distribution is a mathematical function that provides the probability of outcomes in an experiment. This provides all possible values within a range for a random variable. This distribution will have the following properties a) All probabilities will be between 0 and 1 b) Mutually Exclusive c) Collectively exhaustive with the sum of all probabilities equal to 1.

For example, if a die is rolled, the probability of getting any number from 1 to 6 is 1/6 => 0.1667. All outcomes 1 to 6 are mutually exclusive because when we roll a die, we can get any outcome within the range.

Sum of all the outcomes: the probability of rolling a 1 => 1/6, Rolling a 2 =>1/6, Rolling a 3=>1/6, Rolling a 4=>1/6, Rolling a 5=>1/6, Rolling a 6=>1/6. When we add all the outcomes => 1/6+1/6+1/6+1/6+1/6+1/6=>6/6=> 1

Source

Continuous probability distribution when the variable in question is a continuous random variable, the probability of a single value is zero, thus we cannot table any frequencies as we did for the discrete probability distribution. Therefore, it is expressed as a density function (area under the curve) and assumes a value between 2 points.

Source: For example, consider the probability density function shown in the graph below.

Suppose we wanted to know the probability that the random variable X was less than or equal to a. The probability that X is less than or equal to a is equal to the area under the curve bounded by a and minus infinity — as indicated by the shaded area. Note: The shaded area in the graph represents the probability that the random variable X is less than or equal to a. This is a cumulative probability. However, the probability that X is exactly equal to a would-be zero. A continuous random variable can take on an infinite number of values. The probability that it will equal a specific value (such as a) is always zero.

Normal Distribution(Source): This is a type of continuous probability distribution. The graph of the normal distribution depends on the mean and the standard deviation. The following are the properties of a normal curve

  • The total area under the normal curve is equal to 1.
  • The probability that a normal random variable X equals any particular value is 0.
  • The probability that X is greater than a equals the area under the normal curve bounded by a and plus infinity (as indicated by the non-shaded area in the figure below).
  • The probability that X is less than a equals the area under the normal curve bounded by a and minus infinity (as indicated by the shaded area in the figure below).

Additionally, every normal curve (regardless of its mean or standard deviation) conforms to the following “rule”.

  • About 68% of the area under the curve falls within 1 standard deviation of the mean.
  • About 95% of the area under the curve falls within 2 standard deviations of the mean.
  • About 99.7% of the area under the curve falls within 3 standard deviations of the mean.

Pearson’s correlation: Pearson’s correlation coefficient can be computed if the dataset adheres to the following assumptions

a) When the data is said to be normally distributed

b) Linearity- When the two variables are linearly related

c)Homoscedascity — When the distance between the regression line and the data points are similar across the dataset

Source
  • r = Pearson Correlation Coefficient
  • n= number of the pairs of the stock
  • ∑xy = sum of products of the paired stocks
  • ∑x = sum of the x scores
  • ∑y= sum of the y scores
  • ∑x2 = sum of the squared x scores
  • ∑y2 = sum of the squared y scores

Spearman’s correlation: This correlation can be used when the data does not follow a normal distribution. This can be used when the variables are ordinal in nature. Therefore, it is a non-parametric test and measures the degree of association between 2 variables. The following is the formula :

Source

Kendall’s rank correlation(Source): Like Spearman’s correlation, Kendall’s correlation is a non-parametric test and measures the strength of dependence between 2 variables. If we consider two samples, a and b, where each sample size is n, we know that the total number of pairings with a b is n(n-1)/2. The following formula is used to calculate the value of Kendall rank correlation:

Concordant: Ordered in the same way, Discordant: Ordered differently.

--

--