Semih Gülüm
Deeper Deep Learning TR
12 min readMar 16, 2021

--

A Depth Look at Bias and Variance

This article was written within the scope of the studies we carried out at AdresGezgini R&D Center.

Although it will be a little daunting, let’s start with the formulation as a first look at the subject. Then, let’s come to a position where we can say yes and that’s why, by learning the logic instead of memorizing what is the way it is and why. First, let’s consider the concepts of variance and bias as definitions.

The error rate in the training set of the model is the bias of the model. How bad the performance of the model in the test dataset is than the training dataset is the variance of that model. In other words, variance is the error of sensitivity to changes in the training set.

Before moving on to the formulas, let’s remind a little rule. Famous Bayes Rule:

NOTE: Let’s try to explain the logic of Bayes’ rule in a small note. Let’s consider sets. As we know, P(Y|X) meant finding Y according to the given X. As we can see from the picture below, the given X means taking all of the X and finding Y from there (which will be the intersection of the two sets.) The same is true for P(X|Y). (This is because, given X, our space that will cover all possibilities becomes X. Because we are only dealing with space X.)

But something must have caught our attention here. X∩Y and Y∩X, which are swapped in the numerator, are actually the same thing! After this hint, let’s write the equation we have in terms of it and rearrange P(Y|X):

And in this way, we obtained the formula at the beginning of the article by providing the Bayes Rule.

Let’s get back to our topic. Given a data set D={(x1,y1),…,(xn,yn)} distributed according to the P(X,Y) distribution. Suppose there may not be a unique y label for any x entry in our dataset. For example, if your vector x describes the house’s features (e.g. bedrooms, square meters etc.) and the y describes price, we can assume that two houses with the same description are selling for different prices. In other words, both houses can have 2 bedrooms and an area of 100m², but one of them is a house with a garden, security, pool… while the other is a basement flat. These houses will have the same input value according to our x features, but the prices of the two will be very different from each other. So for any feature vector x there is a scatter over possible labels. Now let’s make some important definitions.

The E() function represents Expectation.

NOT: The integral is used to express the total change, or “accumulated amount of change,” over a given interval. So by taking the accumulation of the change along y, we got the expected label.

Here by expected label ȳ(x), we mean the label you expect to get given the feature vector x.

Now we plot our D training set, which has n inputs, according to the P distribution. Let’s have a function A and it represents the algorithm, that is, the model. When we put the data set into this algorithm, that is, we do A(D), let’s call this process h_D.

Giving the training data to the model and displaying h_D

Let h_D take x as input and return y as output. We can calculate the generalization error (as measured as loss squared) for a given h_D learned in dataset D with algorithm A as follows:

NOTE: Here x and y are our real values (x,y) points. And “(x,y)~P” meaning is expected value of x,y from P.

Here ĥ is the weighted average.

Here D is our training points and (x, y) pairs are test points. Notice that the left side of the equation is actually the squared error we always use. We have now reached the Expected Test Error. We are interested in this statement precisely because a machine learning algorithm evaluates the quality of A against a data distribution P(X, Y). Here we reach the actual results of which algorithm will be selected or our newly established model. In other words, this is the part where we will see what realm we are in according to our dataset placed on the P distribution. We have always come this far by combining, but from here on we will continue by separating and try to simplify it by reaching the other equations we have. But first, since we have a lot of unknowns and in order not to confuse them, let’s create a list of what the variables represent in order to reach all our variables easily:

A little forward-looking reminder of what the variables represent

As a first step, let’s add ±ĥ(x) to our Expected function. Thus, we will get a different equality that we can play on without changing the equality:

Now let’s apply the square operation, but first let’s call the first bracket a and the second bracket b. So a²+2ab+b² will come up:

If we look carefully at the middle equation, 2ab, we can see that it is actually equal to 0. Let’s take a closer look at this equation:

NOTE: Since x and y are coordinates of a data, they are dependent on each other but not on D.

Here ĥ(x) will behave like a constant, so let’s exclude it from E_D:

Now we are in an important place. It will be seen that the h_D data is the Expectation of the data in the x plane (same as the equation in the “expected classifier” formula given above):

Now we only have a²+b² left. Let’s rewrite our equation:

If we look carefully at the part that we have indicated as a² above, we take the square of the mean and the difference of a random data from the data set. So we take the VARIANS. This variance value is the variance of our prediction.

Now, let’s apply similar operations to the second part of the equation, which we have defined as b², that is, to the part where the label is extracted from the average prediction, in the form of ±ȳ(x):

Now let’s say a and b again. Our result will again be a²+2ab+b².

Again, let’s try to see that the results of the 2ab operation cancel each other out. But for that, we’ll have to look closely at that part of the equation:

ȳ(x) is a constant because this function depends on x. (Don’t be surprised if it’s a ȳ function.)

Now our original formula is a²+b² again. Let’s rewrite that formula:

NOTE: Thanks to Noise, even if the model has made a 100% correct guess, an error value will still return because the points of the data are not always in the expected labels. So how can we reduce noise? We can clean our data or add new x features. (The logic of adding a new x is that estimating the price of a house only by bedroom or square meter size, because houses with the same number of rooms can have very different prices. However, these data can be separated from each other with new x’s.)

NOTE: Looking at the bias formula, it can be seen that the bias does not depend on the training data set, but only on x.

Let’s finalize the formula and just take the terms:

Now we are done with the formula part. Now, let’s try to approach the concepts of overfit and underfit from a different perspective, with this formula, by considering the verbal parts of the subject.

Let’s try to understand the relationship between bias and variance and overfit-underfit:

Goldilocks Zone means that something is exactly where it should be.

Two variables that measure the performance of your model are bias and variance. Let’s take a real-life example of bias. In an underfit model, our model will think it is good enough to classify our data, which may not be true. Imagine being shown a picture of a cat 1000 times, Now blindfolded, whatever is shown for the 1001st time, you are very likely to say that the cat is very high (high bias). Here we are simplifying the assumptions.

If a classification model is underperforming, i.e. testing or training error is too high, there are several ways to improve its performance. To find out which of these many techniques is right for the situation, the first step is to identify the root of the problem.

Cornell Üniversitesi Computer Science Lecture 12

The graph above shows the training and the test errors and can be divided into two situations. In the first case (left of the graph), the training error is below the desired error threshold, but the test error is significantly higher. In the second case (right of the graph), the test error is very close to the training error, but both are above the desired tolerance of ϵ.

So how do we understand high variance or high bias? What are their “symptoms”?

Symptoms of High Variance:

  • The training error is much smaller than the test error.
  • The training error is below the ϵ tolerance.
  • The test error is above the ϵ tolerance.

→ Ways to Avoid High Variance:

  • Adding more training data: As long as we have access to more data and enough computational power to process the data, it is the simplest and most reliable way to handle variance.
  • Applying L1-L2 regularization or dropout: Reducing the variance in this way increases the bias.
  • Adding an Early Stop: Finishing Gradient Descent early decreases the variance but increases the bias. This is because the model is less trained than the previous one and is closer to underfit.
  • Reducing the complexity of the model: The complexity of the model increases its tendency to overfit. So lowering it might be an option.

Symptoms of High Bias:

  • The training error is above the ϵ tolerance.

→ Ways to Avoid High Bias:

  • Increasing the size of the model: For example, we can increase the number of neurons or layers in the model. Thus, the training set will be better fitted.
  • Applying the L1-L2 regularization or reducing it by dropout: Reducing the bias in this way increases the variance.
We see that basically dealing with bias and variance is about dealing with overfitting and underfitting.

High Variance — High Bias: Model is inconsistent and on average wrong
Low Variance — High Bias: Model is consistent but low on average
High Variance — Low Bias: Slightly correct but inconsistent in averages
Low Variance — Low Bias: Ideal scenario, model is consistent and accurate on average.

If we give a real-life example of this table, let’s say we throw the darts in our hand on a dartboard. If I’m a very sharp shooter (low variance) and shoot from the center or very close to the center, it’s low bias-low variance. If I haven’t lost anything from my sniping skills, but forgot my glasses at home and if I’m nearsighted, even if I miss the center point, I won’t lose my hand adjustment, so I will throw it near the same point, which is high bias-low variance. As it can be understood from here, high bias means that my average distance from my target (which we aimed at the center point here) is higher. Another scenario is that I have glasses, yes, but I cannot be said to be a very good marksman. I do my shots a little to the right, a little left, a little up and a little down, but actually when I average my shots, I’m around the center that I’m aiming for. This situation is low bias-high variance. In our last scenario, I am a bad sniper and I did not take my glasses, and with the effect of myopia, I throw them as if aiming at a different place than I aimed. There is no order in my shots and I am far from the target I want. This situation is high bias-high variance.

NOT: Finding the right balance between the deviation and variance of the model is called the Bias-Variance Trade-Off.

→ Overfit models may also have low variance in some cases, so this is not a law of nature.

As more and more parameters are added to a model, the complexity of the model increases and the bias decreases continuously. For example, the more polynomial terms added to a linear regression, the greater the complexity of the resulting model. According to the model complexity, the bias of the model decreases and the variance increases. Because of this relationship between them, they are called “trade off”. If we take one to zero, the other gets very big. However, this is not a weakness, but rather a strength. If we remember the formulas of all three terms, they are all quadratic equations and one will always dominate the other two. Our aim is to find the balance between them.

It is a long article, but I think it is indispensable in terms of understanding the subject in detail. I hope you also agree with me. Stay well!

References

https://www.youtube.com/watch?v=bUI8ovd07uI

https://courses.cs.washington.edu/courses/cse546/12wi/slides/cse546wi12LinearRegression.pdf

--

--

Semih Gülüm
Deeper Deep Learning TR

Data Scientist at Accenture || Data Science MSc. Student at Sabanci University || Writing articles on Data Science & Deep Learning