How to Explain Bais-Variance Trade-off in a Statistic Way
Bias-Variance Trade-off is usually used to explain accuracy v.s. overfitting tradeoff. However, it is not rigorous and difficult to correlate overfitting with variance. Here is my version of explaining this. Of course, there is not much innovation here and many people could have thought in the same way.
Assume there exists a True Model f that, given the input observation, gives the exact output we would observe in real world.
Then, we are trying to estimate f by f’ with some method M:
Of course, f’ is not the perfect model and there exists error. Note that the error here is not necessarily a number, it is more philosophically an error idea that represents the difference between model. How do we quantify that? Let us not and just denote epsilon as that error. So, in other words,
Now, assume we have a superpopulation and have infinite dataset from it with the same quality and quantity as our current dataset on hand. Then, we will have infinite number of f’ and error. Let us index each f by some integer i.
Bias is the expected value of this error. Since we have infinite errors, the mean converges to the expected value in probability. Therefore, when we say we have a smaller bias, that means we have a smaller average error of the estimated model given M and a similar quality dataset as on hand.
Since we assume a fixed quality of dataset, the best and the only way to improve bias is to adds more complexity to the model. So M becomes more complex. It then leads to a higher variance.
Variance can be seen as the expected value of the square of the difference between true model and estimated model. In English, it measure how variant f’ is. Note, Var does not equal the square of error because it is not always true that E(f’) != f. When M is more complex, it is easier to overfit a dataset. Therefore, on average, the estimated model given a more complex model will be more centered around the true model but with higher fluctuations.
With simple algebra, one can see that total error = bias square + variance. The square term added is to make sure both terms are in the same order so it makes sense to sum them up.
In conclusion, given the same quality of dataset, when we improves the bias, we often increases variance as well. The task is to reduce total error so in application we need to make balance between var and bias. It usually means to find the correct level of complexity.