Covariance and Correlation

Adam Zerner
8 min readNov 1, 2016

--

  • As you increase a person’s height, their weight tends to increase as well.
  • As you increase a person’s age, their net worth tends to increase as well.
  • As you increase a person’s age, how fast they can sprint 40 yards tends to decrease (once you’re older than, say, 30).
  • As you spend more time in school, your income tends to increase as well.

There are two questions:

  1. When X gets bigger, does Y get bigger, or does it get smaller? (direction)
  2. Does Y get a lot bigger/smaller, or just a little bit? (strength)

Big picture:

  • Covariance answers the first question. Using it to answer the second is very iffy.
  • Correlation answers both questions. To do so, it starts off by calculating the covariance, and then makes adjustments.

Covariance

Calculating

Let’s look at the height and weight of the players on the San Antonio Spurs:

  1. Find the averages:

2. Let’s see how far off each player is from the average height/weight:

3. Multiply those differences:

4. Add them up:

5. Divide by the number of players:

1175 / 15 = 78.3

The covariance is 78.3.

Formula

Remember what this is doing:

  1. Find the averages.
  2. Calculate the difference between each x and the average of x, and each y and the average of y.
  3. Multiply those together.
  4. Add them up.
  5. Divide by n.

Alternatively, this can be thought of as the expected value of the product. Ie. “how big do we expect the product of the x residual and y residual to be?”.

Interpretation: direction

Let’s think about why this is true. We’ll start with the positive case:

In the above diagram, “small negative” might mean “the difference between this x value and the mean x value is a small negative number”. More generally, the text is in the above diagrams refers to the difference between the observed x value and the mean x value (or the observed y value and the mean y value).

Note that they all result in positive values. In my examples, the product is always “really big positive” or “medium positive”. So points on an upward sloping line provide a positive covariance.

Similarly, in the negative case:

Note that they all result in negative values. So points on a downward sloping line provide a negative covariance.

With this, we can say that positive covariance implies a direct relationship between the variables (increasing x increases y), and a negative covariance implies an indirect relationship between the variables (increasing x decreases y).

(Also note that points far from the middle have much higher weights than points closer to the middle.)

Another way of thinking about this:

In the picture above, the middle point is (0,0). Instead, imagine that it’s (x average, y average). Also, instead of coordinates meaning “x value” and “y value”, imagine that they mean “x value - x average” and “y value - y average”.

With covariance, we multiply both values. So quadrants I and III give us positive values, and quadrants II and IV give us negative values.

Interpretation: strength

It’s iffy to say “the covariance is large, so there is a strong relationship between the numbers” or “the covariance is small, so there is a weak relationship between the numbers”.

Why? Units. How can we say what “large” is?

Let’s say that we were measuring player heights in miles, and player weights in tons. We’d be dealing with really small numbers. Patty Mills is 0.00113636 miles tall, and weighs 0.0925 tons.

Covariance = 0.00000914 / 15 = .00000061

As you can see above, dealing with such small numbers gives us a really small covariance. Does this mean that the relationship is weak? Not necessarily.

In contrast, let’s say that we were measuring player heights in millimeters, and player weights in grams. We’d be dealing with really big numbers. Patty Mills is 1,828.8 millimeters tall, and weighs 83,914.6 grams.

Covariance = 1,3537,388.70 / 15 = 902,492.58

As you can see above, dealing with such big numbers gives us a really big covariance. Does this mean that the relationship is strong? Not necessarily.

What we want to do is somehow control for the units. The relationship between player heights and weights should be the same regardless of what units we use to measure them. Which brings us to…

Correlation

Think back to z-scores. If I say “my value is 6.8 less than the mean”, what would that mean to you? Is it far from the mean? Close to it? You can’t know!

What if I told you that the mean is 78.8, and the numbers are being measured in inches? Ok, that’s helpful — it’s 6.8 inches shorter than the average of 78.8. But we still don’t know how far away it is relative to the other values.

What if I told you that the standard deviation is 3.54? Now we’re talking! Now we can say that the value is 1.92 standard deviations less than the mean.

z-scores allow us to standardize. We don’t have to be using units to interpret the data. We don’t have to be thinking about the other values and how close our value is to the mean relative to how close the other values are to the mean. We just have one unitless number that is easy to interpret.

What if we did something similar to our covariance calculations?

In columns 4 and 5 (Actual height - average height and Actual weight - average weight) we said “the given value is this far from the mean”. What if we standardized it? What if we said “the given value is ___ standard deviations away from the mean?

Ah, that’s better! Immediately, I’m now starting to understand the data much better. I have a much better sense of how much variability there is. And I could see that there’s a bit more variability in height than there is in weight. I wouldn’t be able to see this stuff without a standardized metric like z-scores.

I wonder what would happen if we continued the covariance calculations starting with these normalized values instead of the actual values…

12.35 / 15 = .82

This “standardized covariance” calculation is actually called correlation!

Formula

Intuitive formula: “Take the covariance using the normalized X and Y values instead of the real ones”

Rearranging with algebra… we arrive at the formula for correlation that most people have seen and use:

This formula is easier to plug numbers into… but the first one is way more intuitive! I think it’s important to differentiate between the intuitive form, and the form that’s easier to work with.

It’s between -1 and 1

Huh? How? z-scores don’t have to be between -1 and 1. I mean, they usually are, but they don’t have to be. I know they’re being multiplied, and it’s unlikely for both z-scores to be so far from 1/-1… but it’s still possible! What if all of our z-scores were in the hundreds?!

What if we try to make our values really far from the mean, so that we get big z-scores:

If we do that, we get correspondingly huge standard deviations. Think about it — standard deviation says (approximately) how far the value is from its average, and each value is very very far from its average.

The standard deviation of this data set is 500,000,000 (for both x and y), so the z-scores are all 1.

What if we add a bunch of intermediate values to decrease the standard deviation?

50 data points of (5 billion, 5 billion), 1 of (0,0) and 1 of (10 billion, 10 billion)
  • Mean of x = 5 billion; mean of y = 5 billion.
  • Standard deviation of x = 1 billion; standard deviation of y = 1 billion (about).

Here we’ve successfully decreased the standard deviation such that (0,0) and (10 billion, 10 billion) have large z-scores (-5 and 5). However, we end up dividing by a big number because we have all those intermediate values. And unfortunately, those intermediate values aren’t doing their part in increasing the sum.

Correlation = 50 / 50 = 1

What if we make the intermediate values far enough from the mean so that they too influence the sum?

If we make them bigger, we’ll also be increasing the standard deviations, which will make the outlier points contribute less to the sum. We can’t have it both ways!

Strength

What consists of a strong correlation? A weak one?

--

--

Adam Zerner

Rationality, effective altruism, startups, learning, writing, basketball, Curb Your Enthusiasm