All about Covariance in Data Science

This blog aims to explain the Covariance which is a very important topic in Feature Engineering in Data Science. In addition to that, this blog will also cover its use-cases, advantages & disadvantages.

Harshit Dawar
The Startup
6 min readOct 31, 2020

--

Data Science is a very hot topic at present. Most of the pursuing students are selecting this field as their profession, in addition to that, many corporate guys are also shifting towards this technology by seeing the scope of this field.

Since Data Science is very much famous & a hot topic, that is why it is attracting most people which is an amazing thing, but in contrary to that, most of the guys, when they start learning in this field, they have a feeling to learn it as soon as possible. But, Data Science is a very much vast field, it can be considered as an ocean of different concepts which has to be understood clearly in order to excel in this field.

The major problem with most of the people is that they do not try to understand the concepts, & as a consequence of this most of the concepts are missed in the field like Data Science, moreover the concepts which are learned, are also not properly understood.

Because of this learning approach, most of the people remember to find the correlation while doing the Feature Selection, but they didn’t even know where Feature Selection lies in Data Science Pipeline. In addition to that, while finding the correlation, various techniques are a must learn, so that one could understand why a particular technique is applied at the moment? why not others? What is the drawback of the other techniques?

This blog will cover everything related to the Covariance from its internal working to significance to its use-cases.

An In-Depth Introduction to Covariance!

Covariance is an integral technique to find the correlation between the features of a dataset. Covariance is actually a Feature Selection technique that is in turn a part of Feature Engineering.

Correlation

It signifies the similarities between the features, or it can be understood as the dependency between the features. For example, consider 2 features let’s say employee salary & experience. Then in general, as the experience increases, salary increases. This is the correlation between these 2 features i.e. with the increase in one feature, the second feature is increasing.

There can be 3 types of correlations between two features:

  1. With the increase in one feature value, another feature value is also increasing(positive correlation).
  2. With the increase in one feature value, another feature value is decreasing(negative correlation).
  3. With the increase in the value of one feature, there is no change in the value of the second feature(zero correlation).

All of the above-listed correlations can be found out by the Covariance technique. Now, let’s just understand how does it work.

Covariance

It uses the concept of variance to find the relationship between the features. It is calculated by the equation given below:

Covariance Equation for 2 features “x” & “y” [Image by Author!]

The above equation is very similar to the equation of variance because the variance equation is:

Variance Equation[Image by Author!]

The above variance equation can also be represented as:

Variance Equation Expanded [Image by Author!]

It can be clearly observed that the variance & the covariance equation is exactly the same if the covariance is calculated between the same variable.

How correlation is decided using Covariance?

Using the Covariance equation shown above, the sign of the individual quantities for “x” & “y” is obtained, & then their product is taken which actually tells the correlation is positive or negative.

Example of Positive Correlation:

Positive Correlation [Image by Author!]

In the above image it is clear that the mean of both axis i.e. “x’ & “y” lies somewhere in between the axis range, now when I take any point plotted in the image, then definitely if the x coordinate of that point is lying after the mean value of the x coordinate, then y coordinate of that point will also lie after the mean value of the y coordinate & vice versa, which makes both the parameters in the equation of the covariance positive, which provides the positive output(correlation).

Example of Negative Correlation:

Negative Correlation [Image by Author!]

In the above image it is clear that the mean of both axis i.e. “x’ & “y” lies somewhere in between the axis range, now when I take any point plotted in the image, then definitely if the x coordinate of that point is lying after the mean value of the x coordinate, then y coordinate of that point will lie before the mean value of the y coordinate & vice versa, which makes one of the parameters in the covariance equation as positive & another as negative, as a consequence of which, when they are multiplied, the negative output is produced.

Example of Zero Correlation:

Zero Correlation [Image by Author!]

In the above image, it is clear that the mean of both axis i.e. “x’ & “y” lies somewhere in between the axis range, now when I take any point plotted in the image, it has some value for the x-axis coordinate, but having zero y coordinate, which makes one of the parameters in the covariance equation 0 & there is no need to look at other because they have to be multiplied, therefore the final result will be 0.

Significance/Use-Case/Advantages of the Covariance!

It signifies the correlation between the features which help in Feature Selection. For example, if there are 5 features, among them 2 of them are having a positive correlation, then one feature among those two can be easily dropped & fewer features will be used to train the model which helps in faster training.

On the other hand, if 2 features are negatively correlated, then it becomes necessary to use both of them in the training because they are very highly dependent on each other.

At last, if two features are having zero correlation, then experimentation can be done for selecting the features based on their importance in the model training & evaluation.

Disadvantages of Covariance!

There is only 1 disadvantage of the covariance, & that is it does not tell the strength/power by which those features are correlated. Covariance only tells the direction of the correlation between the features.

I hope my article explains each and everything related to the topic with all the deep concepts and explanations. Thank you so much for investing your time in reading my blog & boosting your knowledge. If you like my work, then I request you to give an applaud to this blog!

--

--

Harshit Dawar
The Startup

AIOPS Engineer, have a demonstrated history of delivering large and complex projects. 14x Globally Certified. Rare & authentic content publisher.