Covariance and Correlation — Part-1, First Date

Suraj Regmi
The Startup
Published in
5 min readMay 8, 2020

Covariance and correlation have been the household terms for the people working in the field of statistics, data science, economics, and other quantitative fields. Correlation, which many people have heard more of, is more popular and intuitive than covariance, thanks to its etymology and interpretation friendly mathematical structure. However, correlation itself comes from covariance. Let’s go on our first date with these twins, covariance being the elder one and correlation, the younger.

The First Date

You, my friend who is reading this, and me, are on the same table. A square-shaped four-legged table smiled in awe at the double date we are having. We are having date with covariance and correlation. You, a big fan of the game Call break, purposed us to play call break there. I, an ardent fan of call break echoed your proposal, before the voice could die out. We played a few rounds of games; enthusiastic waits for cards to be fully distributed, brave calls to make the game more competitive, and cute attempts to bring luck to one’s side, the game was filled with fun, excitement, nervousness, and energy.

Photo by Davids Kokainis on Unsplash

When I was thinking about the calls I should be making, one thing occurred to me: the relation between the number of spades I have and the amount of call I am making. Both are random variables, maybe the amount of call dependent on the number of spades I have. Let’s suppose a random variable X as the number of spades and Y as the number of calls I make. Now, I am interested in covariance and correlation between those random variables, and see if they are dependent.

But, what is covariance?

Covariance between X and Y is the degree of joint variability between the two random variables, X and Y.

Mathematically, it is written as:

Cov(X, Y) = E[(X - E(X))(Y - E(Y))] … … … … … (1)

We can get an intuition about covariance from this mathematical expression. As covariance is the expectation of the product, its value depends on the magnitude of the differences and their sign. If both the differences (X - E(X)) and (Y - E(Y)) have the same sign, the product is positive. Similarly, sign being different leads to negative product. In this way, we can have the intuition that covariance is higher and positive when the value of X and Y both tends to be either higher or lower than E(X) and E(Y) respectively. Similarly, if either of them has a different direction, the value is negative. In this way, covariance can be understood as the degree of how the variables vary.

(1) can be expanded to give another expression for covariance.

Cov(X, Y) = E[XY] + E[X]E[Y]

Covariance calculation

So, I noted the number of spades I have and the number of calls I make in a table. The table can be simulated using the following Python code.

So, I had a dataset of 20 points.

[5, 4], [2, 2], [5, 3], [4, 2], [4, 4], [3, 2], [2, 1], [3, 2], [2, 1], [3, 1], [2, 1], [2, 1], [2, 2], [1, 1], [4, 2], [5, 4], [2, 1], [2, 4], [3, 4], [4, 4]

Now, I could calculate the covariance between X and Y using the dataset.

Cov(X, Y) = (Σ[i] (x_i - E(X)) (y_i - E[Y])) / (N - 1), where E[X] and E[Y] are expected value (or mean) of X and Y respectively.

Calculating that, we got Cov(X, Y) = 1.

The positive value denotes they are positively correlated. The direction of change tends to be same but we do not know the exact magnitude. This difficulty in interpretability and it being a non-standard value gives rise to another concept called correlation. If we multiply the whole dataset by 10, the covariance would be 100. Although the covariances are different in these two cases, the relation between them has not changed. This is exactly where the correlation becomes significant.

Standardization and Correlation

The scale which comes from the differences (X - E(X)) and (Y - E(Y)) is creating non-standard value for covariance. We could standardize these differences to get standard value. The standardization is done by dividing the differences by standard deviations.

So, the new expression is: E[((X - E(X)) / SD(X))((Y - E(Y)) / SD(Y))]

This is called correlation. Plucking SD(X) and SD(Y) out of the expectation expression, we get:

Corr(X, Y) = E[(X - E(X)) (Y - E(Y))] / (SD(X)SD(Y))

Corr(X, Y) = Cov(X, Y) / (SD(X)SD(Y)) … … … … … (2)

Now, putting standard deviation values in (2) i.e. SD(X) = 1.21 and SD(Y) = 1.26,

Corr(X, Y) = 1 / (1.21 * 1.26) = 0.65

So, this is the standard indicator. The correlation coefficient is 0.65, so X and Y are mediumly correlated. As call break players can guess, the relation is acceptable and our mathematics supports it.

If the dataset is scaled by the factor of 10, the standard deviations also get scaled by the factor of 10. So, the scales cancel each other and the same value of correlation is obtained.

The value of correlation varies between -1 and 1 with values near -1 being strongly negatively correlated, values near 1 being strongly positively correlated and values near 0 being very weak to no correlation.

Why correlation is bounded above by 1 and below by -1?

Let’s see the variance of X + Y and X - Y.

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)
Var(X - Y) = Var(X) + Var(Y) - 2Cov(X, Y)

Note that variance of -Y is the same as that of Y.

We are dealing with correlation so X and Y are standard random variables. Covariance of standard random variables is the same as the correlation of them.

As the random variables are standard, the variance of X and Y is both 1. Also, variance is always greater than or equal to 0.

So, 0 ≤ Var(X + Y) = 2 + 2Cov(X, Y)
0 ≤ Var(X - Y) = 2 - 2Cov(X, Y)

Using these two inequalities, we can see that -1 ≤ Cov(X, Y) ≤ 1.

Therefore, correlation is bounded above by 1 and below by -1.

Conclusion

As the correlation is not zero and in the medium range, the random variables X and Y are dependent on each other. So, I would wish to have more number of spades as possible to maximize the chance of more number of calls.

Correlation (and of course covariance) is a very important metric for seeing the relation between the variables. While doing data cleaning, the redundant variables are dropped by taking the help of correlation between the variables. However, correlation does not ensure causality. Two variables being correlated might mean either variable being the cause for another variable or both the variables being caused by some other variable.

We did the first date with covariance and correlation, where we introduced the terms, saw their mathematical definition, used them with a callbreak example, and did the proof of why correlation must lie between -1 and 1. In the second date (or part), we will extend this idea to vectors, introduce the covariance matrix and more. Stay tuned!

--

--

Suraj Regmi
The Startup

Data Scientist at Blue Cross and Blue Shield, MS CS from UAH — the views and the content here represent my own and not of my employers.