Calculation of Bias & Variance in python
Bias-Variance Decomposition Demystified
For any machine learning the performance of a model can be determined and characterized in terms of Bias and Variance.
In supervised machine learning an algorithm learns a model from training data.
Y=f(X) + E
The goal of any supervised machine learning algorithm is to best estimate the mapping function (f) for the output variable (Y) given the input data (X).
The mapping function is often called the target function because it is the function that a given supervised machine learning algorithm aims to approximate.
We also have a component for Error term along with the f(X) it is referred by “E”.
The prediction error(E) for any machine learning algorithm can be broken down into three parts:
1.Model Bias
2.Model Variance
3.Irreducible Error
The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.
To generalize,
Error (Model) = Variance + Bias + Irreducible Error
First let us try to understand what these are, then we will examine Bias and Variance with the help of a sample dataset to understand the real time calculation.
Let me try to illustrate with a simple example, an analogy before diving into the concept.
Consider an example of a student going to attend a math exam.
Now the task is studying the book or the relevant chapters/modules for the exam.
If the student studies by just skimming over the topics and chapters instead of actually studying the same then the same student will not be able to perform well in the exam as he/she had only skimmed the topic.
Consider another scenario wherein the student memorizes all the contents in the chapters and problems instead of actually studying/understanding the concept,theories and its implication — In this case if any question which comes out of the book but based on the concepts, that student will not be able to tackle or counter the problem.
So in the above case, Studying the book is the task (Model expectation)
What must be the ideal scenario for the model then?
The ideal one is that we must be able to generalize the model. (i.e) In the above case, we must be able to resolve any problem by studying the concepts,theories and approach for the problem rather than skimming or memorizing the problems.
The above example is just a Layman’s term for explanation.
Let us try to understand the details with respect to the model in real-time
Irreducible Error:
In short,
Model Error = Reducible Error + Irreducible Error
Reducible Error is nothing but the element that we can improve. It is the quantity that we reduce when the model is learning on a training dataset and we try to get this number as close to zero as possible.
Irreducible Error is the one which we cannot control like mentioned above in the beginning — this can be due to various reasons like statistical elements [noise] .This is not in our control.
It is also a meager indication that no model can be perfect.
Bias vs Variance Trade off
The bias and variance of a model are connected always.
We generally prefer models with low bias and low variance but in real-time this would be the greatest challenge — this can also be looked as a specific goal for any machine learning problem.
Bias and Variance are inversely proportional
Bias ∝ 1/Variance
The above relationship is referred to as the trade-off. It is helpful in choosing model and model configuration.
How to calculate the bias-variance trade-off for any algorithm on a given dataset?
If we think about the calculation it is quite hard to calculate the actual bias and variance for a predictive modeling problem.This is because we do not know the true mapping function for a predictive modeling problem.
Even though bias-variance trade off is more conceptualized we can actually calculate the bias vs variance trade off in some cases.
There is a library mlxtend defined by Dr.Sebastian provides a function named bias_variance_decomp() that help us to estimate the bias vs variance for various models over many bootstrap samples.
This libary mlxtend has enormous capabilities that are introduced lately. Calculation of bias and variance being one of them.
We will look at an example of a regression model and a classification model for Bias vs Variance Trade off.
Note that we now have the capability to find out the model bias and variance for Tensorflow/Keras support as well, this will also be covered in this post.
For this we must pip install mlxtend library
For supporting tensor flow/keras the expected version of mlxtend and tf are :
- mlxtend v0.18.0 or greater
- tf.version ≥ 2.4.1
Overview of what we are going to see in this post.
Let us also try to understand the function used inside the library
We will take a look at the function— bias_varaince_decomp
bias_variance_decomp(estimator, X_train, y_train, X_test, y_test, loss=’mse’, num_rounds=200, random_seed=None)
Estimator : A classifier or regressor object or class implementing a fit
predict
method similar to the scikit-learn API.
X_train
: expects an array, shape=(num_examples, num_features)
A training dataset for drawing the bootstrap samples to carry out the bias-variance decomposition.
y_train
: expects an array, shape=(num_examples)
Targets (class labels, continuous values in case of regression) associated with the X_train
examples.
X_test
: expects an array, shape=(num_examples, num_features)
The test dataset for computing the average loss, bias, and variance.
y_test
: expects an array, shape=(num_examples)
Targets (class labels, continuous values in case of regression) associated with the X_test
examples.
loss
: str (default='0-1_loss')
Loss function for performing the bias-variance decomposition. Currently allowed values are ‘mse’ [in case of regression] and ‘0–1_loss’ [in case of classifer].
num_rounds
: int (default=200)
Number of bootstrap rounds for performing the bias-variance decomposition.
random_seed
: int (default=None)
Random seed for the bootstrap sampling used for the bias-variance decomposition.
fit_params
: additional parameters
Additional parameters to be passed to the .fit() function of the estimator when it is fit to the bootstrap samples (This is included as a part of latest mlxtend version)
Returns
avg_expected_loss, avg_bias, avg_var
: returns the average expected
average bias, and average bias (all floats), where the average is computed over the data points in the test set.
I. Calculation of Bias & variance (For Regression):
Let us consider Boston dataset for our regression problem.
After pruning — Using LASSO,
Though we have understood practically, let us also see some mathematical portion of it.
For Regression models, Bias-variance decomposition can be looked at as squared loss function basically into 3 terms — variance, bias and noise [this part is same for classifiers as well].
Ignoring the noise term,
Let us see the value of Bias & Variance
Target function , y =f(x);
predicted target function, ŷ=f^(x)=h(x);
the squared loss S=(y−ŷ)2;
Expectation (E[ŷ]) — this is over training sets.
The main prediction for the squared error loss is simply the average over the predictions E[ŷ] (the expectation is over training sets)
II. Calculation of Bias & variance (For Classifiers):
For classifier, we are going to use the same library — the only difference is the loss function. Here we are going to use- 0–1 loss function.
What is 0–1 loss?
Let us say you have a classification problem (0 or 1) and say hypothetically your dataset has 20 rows. After classification with any of the algorithms, eg: naive baye’s — if we find out that it has predicted 15 correctly and 5 were misclassified this is identified and is called as 0–1 loss. [All the rightly predicted items 15 will be marked as “0” and all the misclassified items will be marked as “1”].
0–1 loss will be ((5/20)*100) in the above case — 25%
Note: For 0–1 loss, mode is used to define the main prediction =>E[ŷ] of the 0–1 loss
The Bias and variance for 0–1 loss are as follows.
Bias is 1 if the main prediction does not agree with the true label y and 0 otherwise:
The Variance of the 0–1 loss is defined as the probability that the predicted label does not match the main prediction:
Variance = P(ŷ ≠ E[ŷ])
Loss= Bias+variance
Let us see an example for illustration — consider the Iris dataset,
After loading the dataset let us try to find out the error(loss) using the mlxtend library and we will also see that how certain models(Ex: RandomForest) have the capacity to reduce variance.
First let us try with Decision Tree,
In the above, we observe that the total expected loss= Sum of Bias+Variance & pruning has some effect over the reduction in variance.
We have heard that Random Forest model usually helps to reduce the variance(reduce the over fitting). Let us also try to see the results from RF model.
It can be observed from the above that RF actually helps in reducing the variance of the dataset.
If we try tuning or say hyper tuning the parameters using GridSearchCV or K-fold, we might end up reducing the variance much more.
From the above, it is clearly proved that RF helps in reducing the overfitting or in other words helps in reducing the variance of the model.
Let us also try to look into KNN
It is usually known that KNN model with low k-values usually has high variance & low bias but as the k increases the variance decreases and bias increases.
Let us try to examine that by using the same Iris dataset.
Now let us see for various values of k what is the train and test plot
For various values of k in kNN, let us also examine how our loss, bias and variances are going to be.
III. Calculation of Bias & variance (In TensorFlow/Keras):
As mentioned before, this library supports keras/tf only in the latest version of mlxtend(v0.18.0 & above) and tf (≥2.4.1).
Let us look at the same Boston housing dataset for our case,
Let us calculate the loss, bias and variance using the mlxtend.
Also it is definitely recommended to use the same number of training epochs that you would use in the original training set to ensure that the proper convergence is achieved.
Key points to remember
Parametric or linear machine learning algorithms often have a high bias but a low variance. Some of the examples of parametric algorithms are Linear, logistic and LDA.Here more assumptions about the form of the target function are made. Higher bias often leads to underfitting of the model.
Ways to overcome underfitting:
a. Try more complex model (which does not make any assumptions)
b. Add Features with higher predictive power (performing feature engineering).
c. Add more training data if possible.
d.Remove noise from the data.
Non-parametric or non-linear machine learning algorithms often have a low bias but a high variance.Some of the examples of non-parametric algorithms are decision trees,kNN,SVM .Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before.High variance often leads to Overfitting of the model.
Ways to overcome overfitting:
a.In case of Decision Trees, try to prune the tree in case if the tree is growing large.
b.For SVM try to modify C value, use linear kernels instead of RBF.
c.For kNN try to attain an optimal k-value(low k value implies overfitting, very high k value leads to underfitting).
d.Try regularization techniques.
e.Try to add more features (cross validation,hold back a validation set).
Summary
To summarize, in this post we have seen what a model bias is; what variance is and what is irreducible error.
What is the trade off between Bias & Variance and how this can be achieved with the help of certain models. We have also seen how to calculate both bias and variance with respect to the regression models, classifiers by using the mlxtend library.
Code base used can be found here.