“It is better to be approximately right rather than precisely wrong.”- Warren Buffet
Let’s say you are planning an event; you estimate the cost of hosting the event, basis the data from the past few years. After the event, you calculate the actual cost. Did you make an accurate prediction earlier? That’s remote! If that was a yes, you either got very lucky or you are lying ;) But, was your prediction good enough? Sure, that’s a possibility!
How did you know that your prediction was acceptable? The deviation of the prediction from the actual was quite narrow. Wasn’t it? In the ML world, this deviation (between the actual and predicted) is called the error.
Error = Irreducible error + Reducible error
In the historic data that you have, if there’s any missing information (missing variables) — say, information about the marketing cost for the event was not captured at all. The deviation that occurs in your prediction due to this, is the irreducible error. It occurs because of features that were not captured. In other words, irreducible error occurs due to the inherent error present in the measurement system and cannot be controlled or reduced even by building good ML models. The same can be related to the measurement error in a device that measures weight, BP, etc.
Let’s talk about what can be reduced/controlled …
We build an ML model to estimate the cost of the event. What is expected from this model? The model is fed with some input (training) data, basis which, the model is expected to make an acceptable prediction on the unseen data. Hence, the quality of data used for model development determines the result (Garbage in Garbage Out). For the sake of this discussion, let’s assume that the train data is of good quality.
The predictor variables — the venue, number of guests, the speakers, cost of catering, etc. are used in estimating the cost of the event. While building the model, you drop all the variables except ‘catering’. That is, the model is built only with the variable — ‘catering’. That is, the cost of the event is estimated ONLY basis the cost associated with catering. Do you think, this will be an efficient model?
An oversimplified model (such as this) which fails to capture the true/underlying relationship between the predictors and the response variable (cost of hosting an event) is said to have high bias. Our model above was biased towards the variable — ‘catering’. Such a model will not work well in the training data as well as in the unseen data. Models with high bias tend to underfit the data.
Bias is simplifying assumptions or having erroneous assumptions in the train data, so that it’s easier to predict.
On the other hand, what about taking into account every possible variation in the training data to build the model?
Let me give you an analogy. Before taking an exam, you memorize everything in the book, word by word. What is the chance that you will perform well? — when the questions are exactly as given in your book. What happens if you are expected to apply logic? We all know the answer to that, don’t we? ;)
Similarly, a model that learns all the variations in the train data, while making the model complex, performs remarkably well in the train data, but fails to perform so much in the test data. Models such as this, which are sensitive to the variations in the data, are said to have high variance and they tend to overfit the data.
Variance is the change in estimate of the target, with change in the training data.
Bias vs Variance
The above image illustrates the underfit, overfit and desired models in regression (estimating the value of a continuous variable) and classification (classifying the input into labelled outputs).
Reducible Error = Bias + Variance
The total reducible error is the sum of total bias and variance in the data. Hence, when one increases, the other decreases.
What’s the trade off?
With a high bias and low variance, the error in predictions in both the training and test data is high (Underfitting).
With a high variance and low bias, the prediction error in the train data is very less, while the error in test data is quite high (Overfitting).
On the one hand, we would like our model, to not miss out on the relevant features and the interesting patterns in the train data. On the other hand, we do not want it to risk over-interpreting the outliers and irregularities in the data.
The ideal scenario would be to have a low bias and low variance. However, that isn’t possible. One of the main objectives of the ML models is to generalize well to unseen data. From the above, it is clear that both underfitting (high bias) and overfitting (high variance) models defeat the purpose. The trade-off is the sweet-spot in-between, where bias, variance and model complexity are at optimum levels.
How to achieve Bias-Variance trade-off?
Reducing Bias Error
Increasing the features or the number of predictors to estimate the target will reduce bias. More features allow the model to better understand the relationship between the predictors and response variable.
Reducing Variance Error
Increasing the training samples will reduce variance. More samples increase the data to noise ratio and hence reduces the variance. Intuitively, this leans over the law of large numbers, which states that as the sample size increases, the data becomes representative of the population, thereby reducing variance.
While the above two methods serve as the first hand treatment to achieve bias-variance trade-off, the following are a few other ways to achieve optimal bias and variance.
- Fit the model with the best model parameters.
- Tune the hyperparameters.
- Use Cross Validation, ensemble, bagging, boosting techniques.
More on these techniques next time!