A mathematical derivation of the Law of Total Variance

Published in

The Startup

7 min readApr 26, 2020

I stepped into an interesting theorem a couple of days ago, the Law of Total Variance. This theorem is built under many other concepts that are not easy to reason about, and because of that, it is not an intuitive fact. I believe that learning by teaching is extremely effective, that’s why try to explain it here, exploring along the way all concepts I judge necessary to understand it, and hoping to address your questions on this matter.

Several insights can be obtained from an experiment once we have a fully specified probabilistic model comprised of a probability law and a well-defined sample space.

A nice mathematical tool arises from probabilistic models: the so-called random variable. Furthermore, along with the discrete or continuous random variables come a Probability Mass Function (PMF, for short) or a Probability Density Function (PDF), respectively. Once equipped with those weapons, we are then capable of calculating expectations and variances.

Expectations and variances are very powerful summary statistics of a random variable implying that those quantities play a central role when dealing with probabilistic models.

To get to the Law of Total Variance, we’re gonna explore some concepts such as variances, conditioning, expectations, and the Law of Iterated Expectations.

Understanding what Variance is

At this point, it is important to settle down some intuitive knowledge about the Variance itself. Mathematically, given a random variable (r. v.), X, the variance is defined in terms of the Expectation (mean) of X:

Def. of Variance

This equation tells us that the variance is a quantity that measures how much the r. v. X is spread around its mean. Simply put, the variance is the average of how much X deviates from its mean. Since it is the expectation of a squared quantity, the variance is always positive:

**Nonnegativity property of the Variance**

Using linearity of expectation, it is easy to derive an alternative equation for the variance in terms of the first and second moments of X:

Variance in terms of the 1st and 2nd moment

To finish up this brief introduction to the concept of variance, I’d like to point that anywhere you study this topic you’ll encounter some advice about how the variance itself is not an intuitive or useful quantity. This is due to its nature, e.g., suppose X represents money, the variance of X would be in units of squared money. There is no such thing as “squared money”. Hence, the variance is an abstract mathematical summary statistic.

Not surprisingly, there is another summary statistic that circumvents this limitation: the standard deviation. The standard deviation is defined as the square root of the variance.

Def. of Standard Deviation

Intuition and Context

Now let’s talk about when it may be useful to calculate variances with the Law of Total Variance and learn about the Law of Iterated Expectations, a key result to derive the Law of Total Variance.

When to apply the Law of Total Variance

Suppose that in an experiment, we are interested in analyzing an event that has the associated random variable X.

Venn Diagram of an experiment with associated r. v. X

Besides, suppose also that we don’t have enough information and it becomes impossible (or very hard) to obtain the variance of X.

A common strategy in probability problems is to condition (sample) the random variable of interest on the value of another random variable. Stated differently, we can draw each realization of a random variable Y from its unconditional distribution, and then obtain the conditional values of X, conditioned on Y = y.

Often, breaking a problem into smaller problems turns out to be a better strategy than trying to solve the original problem.

Continuing with our example, imagine Y creates a partition of X as below. In other words, imagine we are breaking up the Sample Space for X based on the values of Y.

Random Variable X sampled on the conditional universe of the Random Variable Y

This is a good example of when to apply the Law of Total Variance. Among many other scenarios, the LTV is convenient when we want to calculate the variance of a random sum of random variables.

Law of Iterated Expectations (LIE)

The Law of Iterated Expectations is a key theorem to develop mathematical reasoning on the Law of Total Variance. Therefore, it is uttermost important that we understand it.

The Law of Iterated Expectations states that: if Y is a random variable on the same probability space of X, then

Law of Iterated Expectations

Take for instance the example in which we sampled X on the values of Y. The LIE tells us that the mean of X is the mean of the individual means of X given Y for all possible values of Y.

To give you a more intuitive sense, consider a classroom where a quiz will be applied. We equally distribute the students of this class into 3 sections (Y), each section has a mean quiz (E[X | Y = y]) of 70, 60, and 80, respectively. What is the overall expectation of the mean quiz?

By the Law of Iterated Expectations, the overall mean of X is the average of the individual quiz means of all sections:

Overall quiz mean evaluated using the LIE

Mathematical Derivation of the Law of Total Variance

Everything we’ve talked about so far converges to the equation of the Law of Total Variance. We won’t walk through a rigorous proof of it. Nevertheless, you can find one here.

Instead, we’ll use the equations and intuitive arguments we’ve seen until this point. The derivation is provided below, pointing out each mathematical argument used in the process:

The conclusion is shown at (8):

The Law of Total Variance

This equation is powerful and must not be underestimated. I think particularly hard to interpret each term of the right-hand side of it.

Interpretation of the Law of Total Variance

To understand the Law of Total Variance, let’s rationale about the terms in the right-hand side of the LTV equation. In the setting of the equation, all those terms are random variables per se. Let’s explore each one of them:

What is E[Var(X | Y)]?

Simply putting, this quantity is the average of the variance of X over all possible values of the random variable Y.

In other words: take the variance of X in each conditional space of Y = y. Then, take the average of the variances. This is called the average within-sample variance.

What is Var(E[X | Y])?

This is the trickiest one. If we already got the average of the variances between each conditional universe of X sampled on Y, why is this term still relevant?

Note that the first term E[Var(X | Y)], only considers the average of the variances of X | Y. That term does not take into account the movement of the mean itself, just the variation about each, possibly varying, mean.

That’s where Var(E[X | Y]) comes in: If we treat each Y= y as a separate “treatment”, then the first term is measuring the average within-sample variance, while the second is measuring the between-sample variance.

There’s a whole discussion about this quantity and what it represents in a Math Exchange forum which I found out to be very instructive especially this answer which provides a visual explanation of the Law of Total Variance. I won’t talk about it here because we’ve already seen a lot, but if you didn’t get the LTV yet, I strongly recommend you to read it.

Takeaways

The Law of Total Variance is a powerful tool to calculate variances in conditional spaces. Moreover, it can provide an easier path to accomplish the task at hand since probabilistic models in conditional universes are simpler to model.

The main takeaway is that the overall variance of a random variable X can be evaluated as the sum of the within-sample and between-sample of X sampled on another random variable Y.

Lastly, I invite to try out solving some problems so that you can grasp the underlying procedure behind the evaluation of variances using the LTV. Thus, I recommend you to try this problem or at least watch it to have a practical insight into the LTV.