Understanding Bias and Variance in ML once and for all
The bias/variance tradeoff is the first thing most textbooks talk about when they want to go over machine learning/data science/artificial intelligence. The wikipedia page is not very helpful either — its jam packed with terminology that can take time to unravel. I’m going to try to explain these fundamental ideas and how they apply across many different ML problems.
Bias is when you have assumptions about a learning algorithm. Let’s say I have 2 bags of marbles, and I only get a chance to look into one bag. In my quick glimpse, I saw mostly blue marbles, even though I know there are blue and red marbles in it. Hence, when someone asks me what’s in the second bag, I replied “blue marbles.” This is an example of underfitting the data — I saw blue and red, but I simplified to just blue. I had a strong bias, so I picked only one color.
Variance is when we look at data a little too carefully. If we go back to our scenario, and I reply to the person asking about the second bag, “there’s light blue, dark blue, midnight blue, sunrise red, and red apple marbles.” That’s too much detail. I am overfitting the data — I am too detailed about each data point. My answer was too noisy to understand, its hard to generalize because I was so specific about the marble color.
The best spot would be to say there are red and blue marbles in the bag, enough accurate detail, and enough to generalize.