Covariance and Correlation

6 min readMay 31, 2019

Covariance and Correlation are two very important terms used in the field of statistics and probability. Though having a numerous number of uses and applications people found these two topics pretty confusing. Let us try to simplify and understand these topics. The main aim of this article is to define the terms: Covariance and Correlation, differentiate between them and discuss their applications.

Index:

Definition

Mathematical Explanation

Python Implementation

Applications

Conclusion

Definition

Both the terms are used to determine the relationship and the dependency between two variables. What sets them apart is that in correlation the values are standardized whereas in covariance they are not. Now, let us study them in a little detail.

Covariance

It is the measure of the joint variability of the two random variables. This describes both how far the variables are spread out and the nature of their relationship.

Types of covariance:

A positive covariance indicates that when one variable increases, the other one also increases and when one variable decreases the other one also decreases.

On the other hand, a negative covariance indicates that when one variable increases the other one decreases and vice versa.

There is a case when the data is randomly spread out and does not seems to be following any specific trend then we say that the covariance is nearly equal to zero.

Note: This measure is scale dependent because it is not standardized.

Correlation

Correlation measures both the strength and the direction of the linear relationship between the two variables. Correlation can be defined as the standardized form of covariance, which means it is safe to state that correlation is a function of covariance.

Because the data is not standardized, one cannot use the covariance statistics to measure the strength of the linear relationship. So, in order to measure it using a standard scale of -1 to +1, we can use the correlation coefficient.

A correlation coefficient or Pearson coefficient of correlation is a statistical measure of the degree to which changes to the value of one variable predict change to the value of another. It is denoted by the Greek symbol ‘rho’.

Types of correlation:

A positive correlation indicates if the value of one variable increases the other one also increases and when the value of one variable decreases the other also decreases. In positive correlation, the value of correlation coefficient lies between 0 and +1.

A negative correlation indicates if the value of one variable increases then the other one decreases and vice versa. In a negative correlation, the value of correlation coefficient lies between -1 and 0.

No correlation is when the two variables are not dependent on each other and the value of the correlation coefficient is equal to 0.

A perfect correlation is when the value of the correlation coefficient is equal to 1.

Mathematical Explanation

Covariance

The covariance of two variables ‘X’ and ‘Y’ can be represented as cov(X,Y) which is defined as:

Here, E[X] is the expected value or the mean value of the sample ‘X’ and E[Y] is the expected value or the mean value of the sample ‘Y’ and hence the generalized form of covariance is cov(X,Y) = E[XY] – E[X]E[Y].

Some of the properties of covariance are as follows:

The covariance between two same variables is always Zero. For example, if we take X=X and Y=X2 then covariance can be defined as

2. The variance of the sum of two variables X and Y is given as

3. The covariance of a variable and a constant is always Zero.

Cov(X,C) = 0

where ‘C’ is any constant

4. The covariance of a variable X and a sum of the other two variables Y+Z is given as

Cov(X,Y+Z) = Cov(X,Y) + Cov(X,Z)

Another formula for the covariance is:

here, ‘n’ is the number of samples in a dataset and (n-1) in the denominator indicates the degrees of freedom.

(Degree of freedom is the number of observations or independent piece of information in the data that are free to vary when estimating the parameters)

Correlation

As discussed above correlation is the standardized form of covariance and therefore is defined as

here, ρX,Y , σX and σY are correlation coefficient and the standard deviations of both sample ‘X’ and sample ‘Y’.

The values of the correlation coefficient can range from -1 to +1. The closer the value is to -1 and +1 the more closely are the two variables related to each other. Where + and – sign indicates a positive and negative correlation.

From the above explanation, we can formulate that the Covariance is measured in units, which are computed by multiplying the units of the two variables whereas Correlation is dimensionless. It is a unit free measure of the relationship between variables.

Python Implementation

In python we can implement them as:

Importing the libraries required

from sklearn.datasets import make_blobs

import numpy as np

import matplotlib.pyplot as plt

2. Generating two random samples by using make_blobs from sklearn

X,Y = make_blobs(n_samples=20,n_features=2, centers=1,random_state=5)

3. Generating the random plot with using matplotlib.pyplot

plt.scatter(X[:,0],X[:,1], c='blue')

4. Defining the covariance and correlation function using the mathematical formula as discussed above

def covar(x,y):

n = len(x)

return np.dot(np.var(x),np.var(y))/(n-1)

def correl(x,y):

return covar(x,y)/(np.std(x)*np.std(y))

NOTE: Here, we can directly use np.cov and np.corrcoef functionality in order to find the covariance and coefficient but I’ve defined these functions in accordance with the formula in order to get a better understanding of its working.

5. Printing the final values by calling the covar and correl functions defined above

print("COVARIANCE:",covar(X[:,0],X[:,Y])," CORRELATION:",correl(X[:,0],X[:,Y]))

COVARIANCE: 0.042643958434297365 CORRELATION: 0.047375298046165575

Applications

Now, let us discuss some of the main features of both Covariance and Correlation.

The main application of them is to check for the extent of the linear relationship between the two variables. Correlation analysis, as a lot of analysts would know is a vital tool for feature selection and multivariate analysis in data preprocessing and exploration.

They are widely used for data embedding, dimensionality reduction, and feature extractions. A key example of this being Principal Component Analysis (PCA).

The Covariance between the variables in a dataset can help reveal a lower dimension space that can still capture the majority of the variance in the data i.e., it may be possible to combine the variables that are highly correlated (have high covariance) without losing too much information.

Correlation analysis plays a vital role in locating the important variables on which the other variables depend. It is also used as the foundation for various modeling techniques. Proper correlation analysis leads to a better understanding of data.

Feature Extraction using PCA

Conclusion

Correlation and covariance are very closely related to each other and yet they differ a lot. Covariance defines the type of interaction, but correlation defines not only the type but also the strength of this relationship. Due to this reason correlation is often termed as the special case of covariance. However, if one must choose between the two, most analysts prefer correlation as it remains unaffected by the changes in dimensions, locations, and scale. Also, since it is limited to a range of -1 to +1, it is useful to draw comparisons between variables across domains. However, an important limitation is that both these concepts measure the only linear relationship.