Calculation of Bias & Variance in python

Nallaperumal
Analytics Vidhya
Published in
10 min readApr 3, 2021

Bias-Variance Decomposition Demystified

Source: Image by G T on Unsplash

For any machine learning the performance of a model can be determined and characterized in terms of Bias and Variance.

In supervised machine learning an algorithm learns a model from training data.
Y=f(X) + E
The goal of any supervised machine learning algorithm is to best estimate the mapping function (f) for the output variable (Y) given the input data (X).
The mapping function is often called the target function because it is the function that a given supervised machine learning algorithm aims to approximate.
We also have a component for Error term along with the f(X) it is referred by “E”.

The prediction error(E) for any machine learning algorithm can be broken down into three parts:

1.Model Bias
2.Model Variance
3.Irreducible Error

The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.

To generalize,

Error (Model) = Variance + Bias + Irreducible Error

Image by Author

First let us try to understand what these are, then we will examine Bias and Variance with the help of a sample dataset to understand the real time calculation.

Let me try to illustrate with a simple example, an analogy before diving into the concept.

Consider an example of a student going to attend a math exam.

Source: Photo by javier trueba on Unsplash

Now the task is studying the book or the relevant chapters/modules for the exam.

If the student studies by just skimming over the topics and chapters instead of actually studying the same then the same student will not be able to perform well in the exam as he/she had only skimmed the topic.

Consider another scenario wherein the student memorizes all the contents in the chapters and problems instead of actually studying/understanding the concept,theories and its implication — In this case if any question which comes out of the book but based on the concepts, that student will not be able to tackle or counter the problem.

So in the above case, Studying the book is the task (Model expectation)

Under fitting — Image by Author
Over fitting — Image by Author

What must be the ideal scenario for the model then?

The ideal one is that we must be able to generalize the model. (i.e) In the above case, we must be able to resolve any problem by studying the concepts,theories and approach for the problem rather than skimming or memorizing the problems.

The above example is just a Layman’s term for explanation.

Let us try to understand the details with respect to the model in real-time

Differentiating Bias & Variance

Irreducible Error:

In short,

Model Error = Reducible Error + Irreducible Error

Reducible Error is nothing but the element that we can improve. It is the quantity that we reduce when the model is learning on a training dataset and we try to get this number as close to zero as possible.

Irreducible Error is the one which we cannot control like mentioned above in the beginning — this can be due to various reasons like statistical elements [noise] .This is not in our control.

It is also a meager indication that no model can be perfect.

Bias vs Variance Trade off

The bias and variance of a model are connected always.

We generally prefer models with low bias and low variance but in real-time this would be the greatest challenge — this can also be looked as a specific goal for any machine learning problem.

Bias and Variance are inversely proportional

Bias ∝ 1/Variance

The above relationship is referred to as the trade-off. It is helpful in choosing model and model configuration.

How to calculate the bias-variance trade-off for any algorithm on a given dataset?

If we think about the calculation it is quite hard to calculate the actual bias and variance for a predictive modeling problem.This is because we do not know the true mapping function for a predictive modeling problem.

Even though bias-variance trade off is more conceptualized we can actually calculate the bias vs variance trade off in some cases.

There is a library mlxtend defined by Dr.Sebastian provides a function named bias_variance_decomp() that help us to estimate the bias vs variance for various models over many bootstrap samples.

This libary mlxtend has enormous capabilities that are introduced lately. Calculation of bias and variance being one of them.

We will look at an example of a regression model and a classification model for Bias vs Variance Trade off.

Note that we now have the capability to find out the model bias and variance for Tensorflow/Keras support as well, this will also be covered in this post.

Image credits : Wolfgang Rottman on Unsplash

For this we must pip install mlxtend library

Install the mlxtend library

For supporting tensor flow/keras the expected version of mlxtend and tf are :

  1. mlxtend v0.18.0 or greater
  2. tf.version ≥ 2.4.1

Overview of what we are going to see in this post.

Image by Author

Let us also try to understand the function used inside the library

This function is same for all buckets in the aforementioned diagram

We will take a look at the function— bias_varaince_decomp

bias_variance_decomp(estimator, X_train, y_train, X_test, y_test, loss=’mse’, num_rounds=200, random_seed=None)

Estimator : A classifier or regressor object or class implementing a fit predict method similar to the scikit-learn API.

X_train : expects an array, shape=(num_examples, num_features)

A training dataset for drawing the bootstrap samples to carry out the bias-variance decomposition.

y_train : expects an array, shape=(num_examples)

Targets (class labels, continuous values in case of regression) associated with the X_train examples.

X_test : expects an array, shape=(num_examples, num_features)

The test dataset for computing the average loss, bias, and variance.

y_test : expects an array, shape=(num_examples)

Targets (class labels, continuous values in case of regression) associated with the X_test examples.

loss : str (default='0-1_loss')

Loss function for performing the bias-variance decomposition. Currently allowed values are ‘mse’ [in case of regression] and ‘0–1_loss’ [in case of classifer].

num_rounds : int (default=200)

Number of bootstrap rounds for performing the bias-variance decomposition.

random_seed : int (default=None)

Random seed for the bootstrap sampling used for the bias-variance decomposition.

fit_params : additional parameters

Additional parameters to be passed to the .fit() function of the estimator when it is fit to the bootstrap samples (This is included as a part of latest mlxtend version)

Returns

avg_expected_loss, avg_bias, avg_var : returns the average expected

average bias, and average bias (all floats), where the average is computed over the data points in the test set.

I. Calculation of Bias & variance (For Regression):

Let us consider Boston dataset for our regression problem.

The output of our calculation from the above it is evident that Total Error = Bias+Variance, we could also see that the MSE calculated from the sckit-library is almost equal to average expected loss.

After pruning — Using LASSO,

It can be observed that the Bias has been reduced after regularization and there is a slight increase in variance and the total avg error is also brought down

Though we have understood practically, let us also see some mathematical portion of it.

For Regression models, Bias-variance decomposition can be looked at as squared loss function basically into 3 terms — variance, bias and noise [this part is same for classifiers as well].

Ignoring the noise term,

Let us see the value of Bias & Variance

Target function , y =f(x);
predicted target function, ŷ=f^(x)=h(x);
the squared loss S=(y−ŷ)2;
Expectation (E[ŷ]) — this is over training sets.
The main prediction for the squared error loss is simply the average over the predictions E[ŷ] (the expectation is over training sets)

The above table is reference taken from mlxtend github

II. Calculation of Bias & variance (For Classifiers):

For classifier, we are going to use the same library — the only difference is the loss function. Here we are going to use- 0–1 loss function.

What is 0–1 loss?

Let us say you have a classification problem (0 or 1) and say hypothetically your dataset has 20 rows. After classification with any of the algorithms, eg: naive baye’s — if we find out that it has predicted 15 correctly and 5 were misclassified this is identified and is called as 0–1 loss. [All the rightly predicted items 15 will be marked as “0” and all the misclassified items will be marked as “1”].

0–1 loss will be ((5/20)*100) in the above case — 25%

Image by Author

Note: For 0–1 loss, mode is used to define the main prediction =>E[ŷ] of the 0–1 loss

The Bias and variance for 0–1 loss are as follows.

Bias is 1 if the main prediction does not agree with the true label y and 0 otherwise:

Image taken from mlxtend

The Variance of the 0–1 loss is defined as the probability that the predicted label does not match the main prediction:

Variance = P(ŷ ≠ E[ŷ])

Loss= Bias+variance

Let us see an example for illustration — consider the Iris dataset,

After loading the dataset let us try to find out the error(loss) using the mlxtend library and we will also see that how certain models(Ex: RandomForest) have the capacity to reduce variance.

First let us try with Decision Tree,

If you observe here, after pruning the variance of the model has reduced to some extent but the bias is still the same.

In the above, we observe that the total expected loss= Sum of Bias+Variance & pruning has some effect over the reduction in variance.

We have heard that Random Forest model usually helps to reduce the variance(reduce the over fitting). Let us also try to see the results from RF model.

There is a significant difference in Variance, it has reduced drastically.

It can be observed from the above that RF actually helps in reducing the variance of the dataset.

If we try tuning or say hyper tuning the parameters using GridSearchCV or K-fold, we might end up reducing the variance much more.

From the above, it is clearly proved that RF helps in reducing the overfitting or in other words helps in reducing the variance of the model.

Let us also try to look into KNN

It is usually known that KNN model with low k-values usually has high variance & low bias but as the k increases the variance decreases and bias increases.

Let us try to examine that by using the same Iris dataset.

It can be observed that the Bias is relatively high [for k=3] compared to the variance. And the expected loss is more than the RF model.

Now let us see for various values of k what is the train and test plot

With respect to this dataset, we can observe that for low values of k test score of training is more and test dataset is low on other other hand, for higher values of K test results are better than training. Bias variance trade off comes into picture in selecting optimal k value.

For various values of k in kNN, let us also examine how our loss, bias and variances are going to be.

It can be seen that for this particular dataset, as k increases the bias is also increasing.

III. Calculation of Bias & variance (In TensorFlow/Keras):

As mentioned before, this library supports keras/tf only in the latest version of mlxtend(v0.18.0 & above) and tf (≥2.4.1).

Let us look at the same Boston housing dataset for our case,

MSE value calculated for the above

Let us calculate the loss, bias and variance using the mlxtend.

After running the model, the loss, bias and variance for the same are listed above.

Also it is definitely recommended to use the same number of training epochs that you would use in the original training set to ensure that the proper convergence is achieved.

Key points to remember

Parametric or linear machine learning algorithms often have a high bias but a low variance. Some of the examples of parametric algorithms are Linear, logistic and LDA.Here more assumptions about the form of the target function are made. Higher bias often leads to underfitting of the model.

Ways to overcome underfitting:

a. Try more complex model (which does not make any assumptions)

b. Add Features with higher predictive power (performing feature engineering).

c. Add more training data if possible.

d.Remove noise from the data.

Non-parametric or non-linear machine learning algorithms often have a low bias but a high variance.Some of the examples of non-parametric algorithms are decision trees,kNN,SVM .Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before.High variance often leads to Overfitting of the model.

Ways to overcome overfitting:

a.In case of Decision Trees, try to prune the tree in case if the tree is growing large.

b.For SVM try to modify C value, use linear kernels instead of RBF.

c.For kNN try to attain an optimal k-value(low k value implies overfitting, very high k value leads to underfitting).

d.Try regularization techniques.

e.Try to add more features (cross validation,hold back a validation set).

Summary

To summarize, in this post we have seen what a model bias is; what variance is and what is irreducible error.

What is the trade off between Bias & Variance and how this can be achieved with the help of certain models. We have also seen how to calculate both bias and variance with respect to the regression models, classifiers by using the mlxtend library.

Code base used can be found here.

References:

--

--