Supervised Machine Learning can be summarized as shown below
Training data is fed to algorithm, which results in target function derivation. Test data is fed into this target function to get the prediction.
As an example, for simple linear regression algorithm follows equation h(y)=b0+b1X1.
Algorithm uses training data to derive coefficients b0 and b1. Coefficient values are added to formula to derive the target function.
After coefficients are derived and target function is formulated, test data is passed to the target function to get prediction.
Below graph is a representation of simple linear regression. Blue stars represent training data and red star represent test/validation data.
The trend line in red is target function values for feature value X1. It is just a straight line calculated to reduce the loss using root mean square error (RMSE) equation.
This a simple algorithm with various assumptions. Biggest assumption is that the training data follows straight line. While following this assumption algorithm might not consider some of the data points (Ex. two blue start in top left) and mark them as noise or outliers. Such assumptions are called Bias.
Such assumption keep algorithm simple and generalization (straight line) is found easily. As it has not considered few data points, such cases are called Underfitting.
How to reduce Bias?
How about considering each data point and not assuming they follow straight line trend.
This will make the algorithm very complex and it will result in something like below. The problem in this approach is that there is no generalization. Algorithm has considered each and every data point and tried to match predicted values with the actual Y value. But here algorithm has learned too much and could not generalize. Such cases are called overfitting.
That means when it is given the test data (red star) it will not know what to do, because it doesn’t have any generalize trend to follow. This is called Variance.
Now if we redraw the summary shown in Fig 1, it will result as shown in Fig 5.
Oh yes, we did not talk about Irreducible Error. These kind of errors are introduced at data source level. Consider one of the data source is an IOT device and it is not working well. It might send data with lot of noise. Such errors are called Irreducible errors.
Below is summary of different kind of errors.
Based on what we saw so far, it looks like
- Simple algorithms have high Bias and low Variance
- Complex algorithms have low Bias and high Variance
If we were to plot this, it will look something like Fig 7.
Simple algorithms like Linear Regression, Logistic Regression has high Bias but low variance.
Complex algorithms like Decision Tree, KNN, SVM have low Bias but high variance.
How to trade-off Bias and Variance? How to make use of best of both the worlds (Simple algorithm and Complex algorithm).
That’s for next article!!