Bias and Variance — Cut Through the Noise

A mathematical look into where Bias and Variance come from

Surya Kocherlakota
appliedai.de
8 min readAug 8, 2019

--

Bias and Variance are amongst the more misunderstood concepts in ML, as they’re usually described using superficial explanations of under-fitting and over-fitting. In this post, we lay down the statistical groundwork to understand where they come from. The maths is thoroughly explained, so you won’t need to be an expert in Statistics to understand it.

That said, you should be familiar with the basic concepts of a probability distribution and its Expected Value (average). We’ll also assume you’re familiar with the ML ideas of regression and supervised learning. So let’s start with the problem setup:

Supervised Learning Setup

Input and Output Vectors

x is a random input vector of any dimension — e.g. the features of a house (no. of bedrooms, size of living room, etc).

y is a random output vector of any dimension — e.g. the house price. In ML, your aim is to predict y given an x.

Joint Distribution

P(x, y) is the joint probability distribution from which x, y pairs are drawn. This is the key assumption, that there is some process that generates all your data, represented by a probability distribution. Note that there is a different probability distribution over y for every given x. i.e. for a given x there are many possible outputs, y.

In other words, for a given set of house features (same size, same no. of bedrooms, etc), there are many possible prices, each with a probability. This is the conditional distribution P(y|x).

Below is an example probability density function for P(x, y): x is just 1 feature, the house size, and y is the house price. For each feature you add to x, you need an extra dimension to visualise the graph, which is why we’re sticking to just 1 feature for now. In reality you’ll have a lot more information about a house than just its size.

The black line shows the shape of conditional distribution P(y|x=100). Try moving your mouse around the graph to see the shape of the conditional distribution for arbitrary values of x. What your predictor will try to do when you give it a value x, is predict the expectation of that conditional distribution curve. Note that the true conditional distribution has to integrate to 1 (area under the black line should be 1). That’s not quite the case here, as the black line has a smaller area for less likely values of x. To get the true conditional, we’d need to normalise. For now, we’re just looking at the shape — given by the black line.

Dataset

Random variable D representing the dataset, consisting of N x, y pairs. D is obtained by sampling P(x, y) N times.

Model

h is a function/model which outputs a prediction ŷ upon taking an input x.

Trained Model and Learner

A is a learner function, which takes an input dataset D and outputs a trained model h_D. Think of A as some Sci-Kit Learn code that describes the parameters of a model you’d like to build.

h_D is then the the binary model you get by calling .fit() on your code, which, like any h, outputs a prediction ŷ when given an input x.

It is important to make this distinction between the learner and the model, as the only thing you really control when building ML models is the learner function A (i.e. your sklearn code). So it’s in your interests that A has good properties such that it outputs models that work well.

Test Error

This is the typical least squares error metric you’ll have seen. It’s telling you to take the square of the difference between the prediction of your model, h_D(x), and the actual label y for that x. The aim of your model h_D is to minimise this error across all possible x, y pairs that you could sample from P(x, y).

Let’s quickly revise this, to see that these quantities are in fact random variables themselves. We started with random variables x and y which are distributed according to P. As such, D, the result of drawing N times from P is also a random variable. Then, you apply a learner function A to this random variable D to get another random variable: the model, h_D. Therefore, h_D must also have properties such as mean and variance.

Expectations of Random Variables

Note on Expected Values: Inside the expectation function (i.e. inside the square brackets), we have some function of random variables. We indicate next to the E, what distributions those random variables have.

Expected Label

Function that gives out the expectation/mean of y for an input x. This is the y you’d get on average, for that input x. It is the expected value of the green conditional distribution curves you saw in the graph earlier.

In regression, where you predict some ŷ for an input x, your model cannot do any better than simply predicting the output, ȳ(x). Therefore, your aim when developing a model h is to get it to mimic the outputs of ȳ as closely as possible. Of course, in ML you don’t have actually have access to this function ȳ, so you try to estimate it.

Another way to think of ȳ(x) is to consider it as a collapsed version of P(x, y) which instead of giving a distribution over y for any given x, it just gives the expected value of y for any given x.

Think of ȳ as a ‘true model’ that your ML model tries to mimic.

Expected Test Error of a Trained Model

This tells us the Test Error you can expect from a regressor h_D on a new sample (x, y) drawn from P. Theoretically, this is the sum of errors of every possible (x, y) sample that can be drawn from P, weighted by the probability of that sample being drawn.

Practically, you cannot draw infinitely many samples from P, so instead we approximate this ETE as the Mean Squared Error of the model over a finite number of test samples. Be aware that you could have a model with high ETE (i.e. a bad model), but a low MSE, if you get lucky with the samples you drew from P to calculate the MSE.

Expected Test Error of a Learner

Very similar to the previous ETE, except we also average over all possible datasets the model could have been trained on. Note that we fix the size of the datasets to N samples.

So this is the test error you get by averaging the errors of all possible x, y samples drawn from P, and all possible datasets D drawn from Pᴺ.

Expected Regressor

Remember that h is a random variable, so for a given learner A, there is a distribution of models produced: where the probability associated with each trained model h_D is the probability of having picked its dataset D in the first place (which is the product of the probabilities of all the samples it contains). And as with any distribution, there is an expected value to it, which we denote as ħ. You can find this expected regressor by averaging all possible trained models for a learner.

Realising that there is a distribution of trained models for a learner is the key point to understanding Variance in ML. It’s simply the variance of that distribution.

We have so far seen an intuition for bias and variance, by treating the model we train as a random variable (because it’s a function of our dataset, which is a random variable).

Bias is how much our expected model differs from the true model ȳ(x). Variance is how much the various possible trained models you train vary around the expected model.

Let’s first peek at the final equation and unpack the precise meaning of the terms Bias, Variance and Noise.

Variance: On average, how much does the output of each trained model vary from the output of the average of all models for that learner?

Bias²: On average, how much does the output of the average model for the learner vary from the expected true output for its input?

Noise: On average, how much does the expected output for an input vary from its true output?

There’s a lot of averages and expected values floating around here, so a good way to digest this is to carefully look at this decomposition equation and think about what each of the terms mean — and refer back to where the terms are introduced if needed. Remember, Bias and Variance are always properties of a Learner, not the trained model itself.

Derivation

All we’ve done so far is take the ETE of a learner A , then add and subtract ħ(x) inside the expectation. Of course, this is equivalent to just adding zero, but we need this for later steps. Also note that, we’ve shortened the notation next to the E a little bit; but it should be (x, y)~P, D~Pᴺ as before.

Now we’ve used (a+b)² = a² + b² + 2ab to expand the square inside the expectation. But it turns out the 2ab term is in fact zero. Quick proof for that:

Let’s now expand out the square in the term using the same tricks. It once again turns out the last term is zero (Try proving this yourself!):

So we end up with this final decomposition of the Expected Test Error of learner A:

In Words

Variance: On average, how much does the output of each trained model vary from the output of the average of all models for that learner?

Bias²: On average, how much does the output of the average model for the learner vary from the expected true output for its input?

Noise: On average, how much does the expected output for an input vary from its true output?

Intuitively, if A has high variance, it shows that the predictions of the trained h_D vary quite a lot based on which dataset D it is trained on. And this is the exact thing we’d like to avoid with our models, since it shows that our learner A is too sensitive to its training data. This is overfitting.

And if A has high bias, it shows that the predictions on average are just wrong. This usually indicates that the model is not expressive enough to model our data. This is underfitting.

Footnotes ††

How we represent Noise in the system

You may have seen a Gaussian distributed Noise term in other derivations of Bias/Variance (such as wikipedia). The joint distribution P(x, y) assumption that we use in fact encapsulates the noise. ‘Noise’ tries to capture the uncertainty in our system: it tells us that given an input x, you cannot be certain about the output. A joint distribution is an effective way of treating this uncertainty, since each x will have a distribution over y.

The other viewpoint uses a slightly different setup, where a deterministic function f(x) outputs y for an input x. But to indicate that each x does not always correspond to the same y, a Gaussian distributed noise term ε is added, giving y = f(x) + ε. This’ll lead to a very similar decomposition of bias and variance. FYI: the reason this error term ε is assumed to be Gaussian distributed comes from the Central Limit Theorem.

How size of your dataset affects variance

If your dataset was really really big, then your learner is unlikely to have a large variance. Conversely, training on a tiny dataset will result in a high variance even if the model is pretty good. This is reflected in the expectation equation, where we explicitly state that D~Pᴺ.

Thanks to CS4780@Cornell

The derivation and intuitions laid out in this post roughly follow this lecture taught by Kilian Weinberger. I’d also highly recommend checking out the rest of his course, it’s one of the best explained ML courses online.

Joint Probability Distribution, showing slices conditioned at x=100 and x=128.

Thanks to Sebastian Wagner.

--

--